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ABSTRACT 


The  techniques  of  decision  theorj  are  applied  to  the  problem  of 
constructing  machines  that  improve  their  ability  to  recognize  patterns  by 
extracting  pertinent  information  from  a  previously  unclassified  sequence 
of  observations;  such  machines  are  said  to  learn  without  a  teacher. 

A  general  system  solution  is  obtained  which  includes  the  solutions 
to  the  problems  of  learning  without  a  teacher,  learning  with  a  teacher, 
and  no  learning.  The  solution  has  been  extended  to  include  problems  in 
which  the  unknown  parameter  is  time  varying,  as  well  as  problems  in 
which  the  probabilities  of  occurrence  of  the  classes  are  unknown 
a  priori  and  must  be  learned.  The  resulting  systems  are  shown  to  be 
stable  and  to  have  performance  which  converges  to  the  performance  of 
systems  which  have  a  priori  knowledge  of  the  unknown  parameters  being 
learned.  It  has  been  demonstrated  that  for  most  cases  either  the  optimum 
system,  or  a  suboptimum  system  which  performs  within  an  arbitrarily  small 
tolerance  of  the  optimum  system,  is  realizable  in  the  sense  that  it 
requires  a  finite  memory. 

The  techniques  of  this  paper  are  applied  to  examples  of  learning 
problems  in  the  communications,  radar,  and  electromagnetic  reconnaissance 
fields . 
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SYMBOLS 


a  the  amplitude  of  a  narrowband  signal,  a  scalar  random 

variable 

A  the  parameter  of  a  Rayleigh  distribution 

(A  )  a  set  of  vector-valued  unknown  parameters 

A  the  event  P-.( 0 1  A.  )  >  P(0.  |  A.  )  for  all  0.^0  (see 

K  _  _  _  \  K  IK  1 

Chapter  III) 

b(t)  a  known  waveform  of  finite  time-bandwidth  product 

B  a  known  column  vector  with  the  sample  values  of  b(t)  as 

elements 

B^  the  event  P(6|Ak)  ^  Q ^  I  )  for  some  0^  ^  0 

c  an  unknown  scalar  parameter;  also  used  as  an  index  to 

indicate  "current  value  of” 

a  constant  =  A2/^A2R2  +  lj 

d  divergence  (see  Chapter  III) 

d( • )  a  decision  rule 

D  a  constant  =  l/s  (f . ) 

n  i 

E(f^)  a  column  vector  with  the  m  element  =  exp  (jBflf^mA) 

E(")  the  expectation  of  the  quantity  within  braces 

f, f^  the  frequency  of  a  sinusoid 

f(*)  a  function  used  to  factor  a  statistic 

g, g^  the  gain,  or  attenuation,  of  a  randomly  time-varying  com¬ 

munication  channel.  The  index  indicates  the  value  is  to 
be  taken  at  a  particular  time,  kT. 

g( • )  a  function  used  to  factor  a  statistic 

G  a  complex  constant  used  to  represent  the  operation  of  a 

fading  channel  on  a  signal.  The  modulus  of  G  is  the 
channel  gain  and  the  argument  of  G  is  the  channel  phase 
shift . 

h( • )  a  function  used  to  factor  a  statistic 
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h  ( t ,  7 ) 


i 


j 

J(f±) 
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K 


M-) 
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m 

M 

M 

c 

n(t) 

N 

P1’P2 

P(  ‘  ) 
P(.) 
Pr  ( • } 


q 


the  response  of  a  time-varying  linear  filter  at  time  t 
to  an  impulse  applied  at  time  7 

the  i*”*1  hypothesis 


an  integral-valued  index  used  to  distinguish  members  of 
a  set 


=  ; 


also  used  as  an  integer 


a  column  vector  with  the 


th 


m 


element 


cos  (2itf^mA) 


an  integral-valued  index  used  as  a  time  index;  e.g.,  gk 
is  the  value  of  g  observed  at  time  kT 


covariance  matrix 

=  | Afc-i ) 1  the  kth  value  of  the  likelihood  ratio  con¬ 

ditioned  on  the  past 


the  likelihood  ratio 


the  loss  associated  with  a  false  alarm  relative  to  the  loss 
associated  with  a  miss  when  the  loss  associated  with  a  cor¬ 
rect  decision  is  zero 

an  integral -valued  index 

the  number  of  possible  classes 

the  memory  capacity  required  of  a  learning  system 
a  noise  waveform 

the  column  vector  with  samples  of  n(t)  as  elements 

the  a  priori  probability  of  occurrence  of  hypotheses  1  and 
2  respectively 

a  probability  density  distribution 

a  cumulative  probability  distribution 

the  probability  that  the  indicated  event  will  occur. 
Occasionally  a  subscript  is  used  to  indicate  the  event, 
such  as  PfAi  which  is  the  probability  of  a  false  alarm 
occurring . 

an  integral-valued  index  used  to  distinguish  different  pos¬ 
sible  values  of  the  unknown  parameter 
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Q 


R 


s(t) 


s„U) 


T 

T(-) 

W 


(w±) 


r(-) 

A 


the  number  of  possible  values  of  the  unknown  parameter 
the  limits  of  the  range  of  a 
a  generalized  signal-to-noise  ratio 

a  signal  waveform  of  finite  time-bandwidth  product  (may  be 
lowpass  or  bandpass) 

the  column  vector  with  samples  of  s(t)  as  elements 
the  noise  spectral  density 

the  time  variable;  also  used  as  a  subscript  on  vectors  to 
indicate  "transpose” 

a  constant,  the  period  of  one  observation;  the  duration  of 
a  signal  waveform 

a  function  called  a  statistic 

the  signal  bandwidth  if  the  center  frequency  of  the  signal 
is  known;  the  range  of  the  center  frequency  if  it  is  unknown 

a  set  of  weights  on  the  taps  of  a  delay  line  used  to  syn¬ 
thesize  a  time-varying  linear  filter 

the  observed,  or  received,  waveform  which  is  to  be  classified 
the  column  vector  with  samples  of  x(t)  as  elements 

-  *i  -  Vi 

a  binary  random  variable 
a  dummy  variable 

an  ordered  k-tuple  with  binary-valued  components 
=  P0/p,i  the  ratio  of  a  priori  probabilities 

Ct  X 

=  Lp2/p^ ,  the  threshold;  in  one  instance  7  is  used  as 
a  dummy  time  variable 

a  dummy  function  used  to  obtain  performance  bounds 
=  l/(2w),  the  sampling  interval 

=  9  -  0  the  kth  perturbation  in  the  unknown  parameter 

K.  K  “  1 
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€  a  small  quantity;  subscripts  are  used  to  distinguish  one 

small  quantity  from  another  as  necessary 

£(t)  a  complex,  lowpass  time  waveform  related  to  x(t)  by  the 

equation  x(t)  =  Re  (C(t)  exp  (jco^t)} 

T|(t)  a  complex,  lowpass  time  waveform  related  to  n(t)  in  the 

same  way  that  £( t )  is  related  to  x(t) 

0  the  unknown  parameter 

=  (Xj_,X2,  .  .  .  ,Xjj)  which  is  used  as  a  shorthand  notation  to 
indicate  that  a  probability  density  is  conditioned  by  the 
values  of  the  past  k  observations 

p(  • )  a  dummy  function  used  to  obtain  performance  bounds 

v( • )  a  moment -generating  function 

p,p(*)  the  average  risk,  a  performance  measure 

T  a  time  variable 

4>  a  phase  variable,  used  as  the  phase  of  a  narrowband  signal 

and  as  the  argument  of  the  complex  channel  parameter  G 

a  set  °*  independent  functions 

<J>  the  set  of  all  possible  values  of  9 

co  =  2«f,  the  radian  frequency  variable 


OTHER  SYMBOLS 

|  indicates  conditioning;  e.g,,  p(x|8)  is  the  probability 

density  of  X  conditioned  on  the  value  of  0 

/\  /S 

Indicates  the  true  value  of  a  parameter;  e.g.,  9  is  the 

true  value  of  9 

indicates  the  real  part  of  the  associated  symbol;  e.g., 

l( 0  =  Re  t)} 

indicates  the  imaginary  part  of  the  associated  symbol;  e.g., 
1(0  =  'Til  Ol(t)l 

indicates  "is  a  member  of";  e.g.,  9  -  J1  means  9  is  a 

member  of  the  set  * 

@  indicates  a  corruptive  noise  operation;  e.g.,  S  0  N 

indicates  that  a  signal  has  been  corrupted  by  the  addition 
or  multiplication  of  noise 
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I .  INTRODUCTION 


The  purpose  of  this  research  has  been  to  apply  the  techniques  of 
decision  theory  to  the  problem  of  constructing  optimal  machines  which 
improve  their  ability  to  classify  patterns  by  extracting  pertinent 
information  from  a  previously  unclassified  sequence  of  observations; 
such  machines  learn  without  a  teacher. 

In  recent  years  interest  in  classification  problems  and  in  machines 
to  automatically  solve  these  problems  has  been  intensified  by  the 
development  of  a  technology  in  which  such  problems  occur  more  and  more 
frequently  and  the  development  of  the  analytical  and  physical  tools  with 
which  to  solve  the  problems.  As  a  particular  example,  the  advent  of  the 
intercontinental  ballistic  missile  has  made  it  mandatory  that  surveillance 
systems  operate  as  rapidly  and  accurately  as  possible;  such  systems 
introduce  a  variety  of  classification  problems.  The  development  of 
high-speed  large-capacity  digital  computers  has  made  it  possible  to  per¬ 
form  extremely  complex  data  processing  in  real  time.  It  is  anticipated 
that  both  the  number  of  classification  problems  and  the  capability  of 
the  tools  to  solve  these  problems  will  increase  in  the  next  few  years. 


A.  CLASSIFICATION  PROBLEMS 

In  order  to  be  more  precise  in  the  meaning  of  "optimum”  and  "learning 
without  a  teacher, "  it  is  necessary  to  define  the  classification  problem 
in  decision-theory  terms:  Given  an  object  and  a  set  of  classes  from 
which  the  object  may  have  been  drawn,  determine  the  class  from  which  the 
abject  was  drawn.  To  get  a  reasonable  solution  (by  some  criterion  of 
reasonableness)  one  must  also  be  given  some  knowledge  of  the  losses 
which  will  be  incurred  if  an  improper  determination  is  made. 

In  order  to  solve  the  problem  some  set  of  measurements  must  be 
chosen.  A  particular  set  of  measurements  will  be  called  an  "observation” 
and  will  be  represented  by  a  column  vector  X.  (Each  element  of  the 
vector  represents  the  measurement  of  a  particular  parameter,  such  as  "the 
amplitude  of  a  voltage  at  time  tQ"  or  "the  color  of  the  object."  Thus 
the  specif icat ion  of  a  set  of  measurements  may  be  thought  of  as  the 
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labeling  of  the  coordinates  of  an  observation  space.  )  For  the  purposes 
of  this  research  it  is  assumed  that  the  observation  space  is  given  (that 
is,  it  is  known  which  measurements  to  make),  and  the  problem  is  to  deter¬ 
mine  a  way  to  process  the  observations  to  make  a  classification  decision 
which  is  in  some  sense  "optimum. " 

In  order  to  come  to  a  definition  of  "optimum, "  some  information  must 
be  given  regarding  the  losses  associated  with  mi sclassif ication .  For 
this  purpose  it  is  assumed  that  a  loss  function  which  provides  this 
information  is  given.  This  loss  function  depends  both  on  the  decision 
to  place  the  object  observed  in  a  particular  class  and  the  actual  class 
from  which  the  object  was  drawn.  Thus  there  is  a  risk  associated  with 
each  particular  decision. 

A  reasonable  definition  of  the  optimum  system  is  that  system  which 
minimises  the  expected  or  average  risk.  Such  a  system  is  a  realization 
of  a  Bayes  decision  rule,  and  throughout  this  report  the  Bayes  system 
will  be  considered  to  be  optimum. 

When  phrased  in  these  terms,  classification  problems  may  be  charac¬ 
terized  in  terms  of  the  probability  measures  induced  on  the  observation 
space  by  the  different  classes  of  objects.  Thus  if  an  object  being 
observed  is  a  member  of  class  1  the  observation  X  will  have  a  cumula¬ 
tive  probability  distribution,^  say  P^X);  if  the  object  belongs  to 
class  2  the  observation  will  have  a  different  cumulative  probability 
distribution,  say  P2(x),  etc.  Three  categories  of  decision  problems 
are  possible: 

1.  The  functional  forms  of  the  relevant  probability  measures  may  be 
completely  known. 


Throughout  this  report  it  is  assumed  that  the  cumulative  probability 
distributions  are  representable  by  probability  density  distributions; 
e.g.,  for  a  scalar  observation 

*(x)  =Ji  Pk)  > 

where  the  integral  is  taken  in  the  Lebesgue-Stielt jes  sense  [Ref.  l] 
and  the  class  of  functions  p(a)  includes  the  delta  function  defined 
by  Middleton  [Ref.  2], 
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2.  The  functional  forms  may  be  completely  unknown. 

3.  The  functional  forms  may  be  known  except  for  some  set  of  unknown 
parameters. 

Problems  which  lie  in  the  first  category  do  not  involve  learning 
because  the  solution  is  defined  once  the  relevant  probability  measures 
are  known  [Refs.  2,  3,  4],  Problems  in  the  second  category  are  commonly 
referred  to  as  nonparametric .  Although  there  are  many  important  problems 
in  this  category  (e.g.,  speech  recognition,  medical  diagnosis,  weather 
prediction)  which  have  been  investigated  with  varying  degrees  of  success, 
no  systematic  analytical  approach  h~s  been  developed  for  such  problems. 
Since  it  will  be  assumed  that  the  functional  forms  of  the  probability 
measures  are  known  except  for  some  set  of  parameters,  the  techniques 
developed  here  will  be  applicable  only  to  the  third,  or  parametric ,  class 
of  decision  problems. 

B.  DECISION  MACHINES  WHICH  LEARN 

A  machine  to  solve  the  classification  problem  must  be  designed  to 
apply  a  decision  rule  to  each  observation.  It  seems  clear  that  the  deci¬ 
sion  rule  should  depend  upon  how  much  is  known  about  the  problem  prior 
to  the  time  at  which  the  classification  decision  is  to  be  made.  If  the 
problem  is  a  parametric  one,  it  is  characterized  by  a  set  of  probability 
measures  depending  on  an  unknown  parameter,  say  C  p  _^  ( X  [  0 ) ;  i  •=  1,2,...,M], 
where  @  is  the  parameter.  Suppose  that  this  set  is  known,  that  an 
observation  is  available,  and  that  some  a  priori  knowledge  of  the  param¬ 
eter  [represented  by  an  a  priori  distribution  PQ(0)]  is  given.  Then 
a  decision  rule  may  be  found  using  standard  techniques  [Refs.  2,  3,  4], 

If,  in  addition,  a  sequence  of  observations,  [X  ,X  , ...,X  )  is 

X.  ct  K 

available,  and  if  this  sequence  contains  information  concerning  0,  this 
information  may  be  extracted  and  used  to  modify  the  decision  rule.  This 
may  be  accomplished  by  using  the  sequence  of  observations  (which  shall  be 
called  a  learning  sequence  and  designated  by  ~K  )  to  compute  a  sequence 
of  condi t ional  distributions  of  0: 

Po(0)  *  p ( t  1  ?v1 )  *  p(t-U2)  *  p(0Uk) 
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This  sequence  of  distributions  defines  a  sequence  of  decision  rules,  and 
the  resulting  machine  may  be  said  to  "learn. " 

It  is  desirable  to  make  a  distinction  between  two  modes  of  learning. 

A  machine  which  learns  with  a  teacher  is  provided  with  two  pieces  of 
information:  (l)  a  learning  sequence  and  (2)  the  correct  classification 

of  each  member  of  the  sequence.  A  machine  which  learns  without  a  teacher 
is  not  given  the  latter  information.  Thus  a  machine  which  learns  without 
a  teacher  may  utilize  only  that  information  which  is  available  prior  to 
receiving  the  first  observation  or  which  is  contained  in  the  learning 
sequence.  In  contradistinction,  a  machine  which  learns  with  a  teacher 
must  be  externally  aided. 

There  are  many  problems  in  which  our  external  means  of  classification 
is  either  poor  or  nonexistent.  If  the  machine  which  is  built  to  solve 
these  problems  must  make  repetitive  decisions,  then  sooner  or  later  an 
observation  sequence  will  become  available.  If  there  is  any  information 
in  one  observation  concerning  other  observations,  and  if  we  desire  a 
machine  which  takes  advantage  of  this  information,  then  we  require  a 
system  which  learns  without  a  teacher.  (The  nature  of  these  problems 
excludes  machines  which  must  be  trained.) 

There  are  also  problems  that  require  a  machine  which  continues  to 
improve  in  performance  after  it  has  been  placed  in  operation.  Included 
in  this  class  are  problems  in  which  the  characteristics  of  the  pattern 
to  be  recognized  are  changing  with  time.  A  machine  could  be  trained 
during  operation  only  if  the  correct  classification  of  each  new  observa¬ 
tion  were  known,  but  if  this  were  the  case  we  would  not  need  the  machine. 

These  types  of  problems  provide  a  compelling  motivation  for  this 
research  which  is  concentrated  on  the  synthesis  of  machines  which  learn 
without  a  teacher. 

C .  RELATED  WORK 

One  of  the  first  engineering  problems  which  led  to  development  of  a 
recognized  adaptive  system  was  that  of  communication  through  a  random 
channel.  In  1956  Price  [Ref,  5]  and  later  Price  and  Green  [Ref.  6], 
using  a  unique  combination  of  theoretical  analysis  and  engineering 
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intuition,  developed  an  adaptive  receiver  called  RAKE  which  effec¬ 
tively  reduced  the  difficulty  of  communication  through  random  multipath 
channels  by  estimating  some  of  the  channel  properties  while  receiving 
signals.  Kailath  [Refs.  7  and  8]  derived  an  optimum  receiver  for  the 
same  problem  and  showed  that  it  exhibited  adaptive  properties.  He  also 
pointed  out  that  the  RAKE  receiver  was  closely  related  to  the  optimum 
receiver.  Proakis  and  Drouilhet  [Ref.  9]  simulated  two  types  of  binary 
communication  systems  using  decision-directed  feedback  to  learn  the 
unknown  phase  of  a  received  signal;  they  have  derived  error  probabilities 
which  verify  that  systems  of  this  nature  are  in  some  cases  superior  to 
nonadaptive  systems.  Scudder,  in  1964  [Ref.  10  },  derived  the  optimum 
learning  receiver  for  the  same  communication  problem;  however,  in  the 
form  he  derived,  the  receiver  grows  exponentially  (see  Chapter  II ).  For 
this  reason  Scudder  proposed  and  analyzed  a  decision-directed  learning 
scheme. 

In  1961  Glaser  [Ref.  11 ]  used  a  combination  of  decision-theoretic  and 
intuitive  arguments  to  arrive  at  an  adaptive  machine  to  learn  unknown 
repetitive  waveforms  in  a  background  of  noise.  Jackowatz,  Shuey,  and  White 
[Ref.  12]  invented  a  machine  for  the  same  purpose  in  1961.  Both  of  these 
machines  learn  with  a  teacher,  using  decision-directed  feedback  as  the 
teacher.  Hinich  [Ref.  13]  performed  an  analysis  of  the  Jackowatz  machine 
in  1962,  and  later  [Ref.  14]  modified  the  mathematical  model  to  obtain  a 
more  precise  analysis. 

The  work  which  is  most  closely  related  to  the  research  presented  here 
was  initiated  by  Braverman  in  1961  [Ref.  15].  Braverman  examined  the 
problem  in  which  a  previously  classified  learning  sequence  is  available 
(learning  with  a  teacher),  and  established  the  fact  that  the  solution 
which  uses  all'  relevant  observations  to  condition  the  a  posteriori 
probabilities,  achieves  the  minimum  average  risk.  He  also  established 
the  convergent  properties  of  this  solution.  This  work  was  extended  by 
Abramson  and  Braverman  [Ref.  16]  and  applied  to  the  problem  of  learning 
the  vector  mean  of  a  random  vector  which  was  normally  distributed.  In 
1963  Keehn  [Ref.  17]  solved  the  more  general  learning  problem  in  which 
the  random  vector  is  normally  distributed  with  both  unknown  mean  and 
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covariance  matrix.  At  the  same  time  Spragins  [Ref.  18]  generalized  the 
approach  of  Abramson  and  Braverman  in  a  different  direction  to  obtain 
the  solution  to  the  general  parametric  learning  (with  a  teacher)  problem 
in  which  a  fixed-size  (nontrivial)  sufficient  statistic  exists. 

While  work  on  the  learning  with  a  teacher  problem  was  continuing, 

Daly  [Refs.  19  and  20]  in  1961  used  a  decision-theory  approach  (the 
"Bayes"  approach  used  by  Braverman)  to  attack  the  learning  without  a 
teacher  problem.  Daly  solved  the  one-dimensional  binary  detection  prob¬ 
lem  and  established  the  convergence  of  the  solution;  however,  he  also 
demonstrated  that  his  solution  required  a  system  which  grew  exponentially 
with  the  number  of  learning  observations.  Both  Daly  and  Scudder  turned 
their  attention  to  systems  which,  like  the  majority  of  those  proposed  to 
solve  the  learning  problem  without  a  teacher,  use  decision  feedback  as 
a  teacher  to  aid  in  the  learning  process.  A  more  complete  explanation 
of  the  exponentially  growing  system  will  be  found  in  Chapter  II. 

In  this  investigation  we  have  taken  the  so-called  Bayes  approach 
(explained  in  Chapter  II )  which  was  used  by  other  investigators  [Refs. 

10,  15-20 ],  and  we  have  concentrated  on  the  learning  without  a  teacher 
problem.  One  of  the  most  important  results  is  the  fact  that  in  many 
problems  this  approach  does  lead  to  systems  of  fixed  size.  In  problems 
in  which  the  system  size  must  grow,  a  change  in  the  formulation  of  the 
problem  will  result  in  fixed-size  systems.  This  change  requires  that  we 
approximate  the  space  of  the  unknown  parameter;  however  the  performance 
of  the  resulting  fixed-size  system  is  in  an  engineering  sense  equivalent 
to  the  growing  system. 

D.  ORGANIZATION,  APPROACH,  AND  SIGNIFICANT  RESULTS 

In  the  first  portion  of  this  report  (chapters  II  and  III),  the  equa¬ 
tions  describing  the  learning  system  are  derived  and  then  applied  to 
signal -detect ion  problems  in  which  an  important  signal  parameter  is 
unknown  but  fixed.  The  performance  of  such  a  system  is  discussed.  In 
the  second  part,  which  consists  of  Chapters  IV,  V,  and  VI,  the  equations 
are  applied  to  problems  in  which  the  important  unknown  parameter  is  time 
varying.  The  stability,  convergence,  and  realizability  of  the  general 
learning  system  are  discussed. 
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The  investigation  of  the  learning  problem  is  initiated  by  defining 
a  suitably  general,  repetitive  binary  decision  problem  depending  upon  an 
unknown  parameter  which  is  fixed.  This  parameter  is  treated  as  a  random 
variable,  and  an  a  priori  probability  distribution  is  chosen  to  describe 
the  initial  state  of  knowledge  of  the  parameter.  As  more  and  more 
observations  are  received,  more  and  more  information  concerning  the 
parameter  is  obtained;  thus  the  observations  "condition"  the  probability 
distribution  of  the  parameter.  By  developing  a  recursive  expression 
which  describes  this  unfolding  sequence  of  conditional  probability  dis¬ 
tributions,  a  mathematical  description  of  the  learning  process  is  obtained; 
and  by  utilizing  this  recursive  expression,  a  learning  system  is  synthe¬ 
sized. 

This  technique  is  applied  to  two  examples  in  order  to  illustrate  the 
types  of  problems  that  are  readily  solved.  The  first  example  involves  the 
detection  of  a  signal  of  known  waveform  but  unknown  amplitude  embedded  in 
noise.  It  has  been  chosen  to  illustrate  the  technique  as  simply  as  pos¬ 
sible.  The  second  example  involves  the  detection  of  a  narrowband  signal 
of  unknown  frequency  embedded  in  noise.  It  has  been  chosen  as  an  example 
of  a  problem  which  frequently  occurs  in  the  electronic -countermeasures 
and  reconnaissance  fields,  which  is  readily  solved  by  the  proposed  tech¬ 
nique,  but  which  has  not  been  attacked  successfully  by  any  other  means 
[Ref.  21 ]. 

The  investigation  is  continued  by  extending  the  development  to  (l)  the 
"learning"  problem  in  which  the  a  priori  probabilities  of  occurrence 
of  the  alternative  hypotheses  are  unknown,  and  (2)  the  repetitive  multiple- 
hypothesis  decision  problem. 

In  the  third  chapter  techniques  for  the  evaluation  of  system  performance 
are  discussed  briefly.  An  example  is  presented  of  the  performance  bounds 
of  a  system  that  detects  the  presence  of  a  narrowband  signal  of  unknown 
frequency  in  bandlimited  white  gaussian  noise.  Thus  this  latter  example, 
which  is  used  in  both  Chapters  II  and  III,  may  be  used  to  present  a  sort 
of  overview  of  the  major  contribution  of  this  research  to  the  reader 
familiar  with  the  so-called  Bayes  approach  to  the  decision  problem. 

Succeeding  chapters  are  extensions  of  the  solution  and  developments 
of  the  properties  of  the  resulting  system. 
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In  the  fourth  chapter  the  technique  is  extended  to  include  problems 
in  which  the  unknown  parameter  is  randomly  varying  in  time.  Depending 
upon  the  model  of  time  variations,  the  resulting  systems  are  simple 
modifications  of  the  learning  systems  for  fixed  parameters.  The  syn¬ 
thesis  technique  is  applied  to  examples  of  communications,  radar,  and 
electronic  reconnaissance  problems. 

In  the  fifth  chapter  the  size  of  the  optimal  learning  system  is 
defined  in  terms  of  the  number  of  elements  required  to  construct  the 
system.  It  is  shown  that  in  many  cases  the  optimal  systems  are  of 
finite  size.  In  the  cases  where  the  optimum  systems  grow  as  the  learning 
sequence  lengthens,  it  is  shown  that  a  suboptimum  system  can  be  con¬ 
structed  from  a  finite  number  of  elements.  In  the  sixth  chapter  it  is 
shown  that  the  finite  suboptimum  system  has  a  performance  which  is  not 
measurably  different  from  the  optimum  system.  Thus  from  an  engineering 
standpoint  the  optimum  learning  system  may  always  be  realized  from  a 
finite  number  of  elements. 

Other  properties  of  learning  systems  are  presented  in  the  sixth 
chapter.  The  systems  are  stable  and  converge  in  performance  so  that  as 
the  learning  sequence  lengthens  the  performance  of  the  learning  system 
is  equivalent  to  the  performance  of  a  system  which  is  given  a  priori 
knowledge  of  the  unknown  parameter. 
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II.  DEVELOPMENT  OF  THE  LEARNING  SYSTEM 


As  was  pointed  out  in  Chapter  I,  there  are  many  repetitive  decision 
problems  in  which  an  important  parameter  is  unknown  and  in  which  external 
aid  is  not  available  from  which  to  obtain  a  representative  set  of  properly 
classified  "learning"  observations.  Such  problems  require  systems  which 
learn  without  a  teacher,  and  it  is  the  purpose  of  this  chapter  to  develop 
and  explain  a  general  technique  for  the  synthesis  of  such  systems.  The 
technique  developed  is  based  on  the  assumption  that  the  unknown  parameters 
are  either  time  invariant,  or  so  slowly  time  varying  that  they  may  be 
treated  as  being  fixed.  The  more  difficult  time-varying  unknown -parameter 
problem  is  investigated  in  Chapter  IV. 

A.  THE  LEARNING  PROBLEM  MODEL  (BINARY  DECISIONS) 

In  order  to  explain  the  techniques  involved  in  the  synthesis  of 
learning  systems,  we  first  consider  the  binary  decision  problem.  We 
shall  phrase  the  problem  in  terms  of  detection  of  a  signal  which  depends 
upon  a  set  of  unknown  parameters;  however,  the  result  will  be  easily 
generalized . 

Assume  that  we  are  given  an  observation  representable  by  the  column 
vector  X  and  a  learning  sequence  A  =  (X  ,X  ,  ...,X  },  consisting 

K  “  1  A  tu  K"i 

of  the  first  (k-l)  such  vectors.  Each  observation  contains  a  signal 

corrupted  by  noise,  or  it  contains  noise  alone,  and  we  desire  to  synthe- 

th 

size  a  system  to  decide  whether  or  not  the  k  observation  X,  contains 

k 

a  signal.  We  assume  that  our  system  may  make  mistakes,  and  that  each 
mistake  costs  something  which  may  be  expressed  in  terms  of  a  function 
which  depends  on  the  actual  situation  as  well  as  the  decision.  We  ask 
for  a  system  which  will  minimize  the  average  risk  associated  with  each 
decision;  i.e,,  we  require  a  system  which  is  optimum  in  the  Bayes  sense 
at  every  decision  instant.  The  system  which  performs  this  minimization 
computes  the  likelihood  ratio  and  compares  it  to  some  threshold.  To  be 
more  precise,  we  let 
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=  the  hypothesis  that  =  s(9 )  ©  N 
H  =  the  hypothesis  that  X  =  N 

a  K 


where 


s(g)  =  the  signal  vector  (unknown  parameters) 

9  =  unknown  signal  parameters 

N  =  the  noise  vector 

©  =  the  corrupting  operation  (addition, 
multiplication,  etc.) 


Then,  if  the  signal  parameters  were  known,  the  optimum  system  would  com¬ 
pute  the  ratio  of  conditional  probabilities,  or  likelihood  ratio  [see 
Ref .  4 ] ; 


^(xje; 


p(xk|h1,9) 

p(xJh2) 


(2.1) 


and  compare  it  to  a  threshold  depending  upon  the  relative  loss  associated 
with  the  two  types  of  errors  (false  alarm  and  miss)  and  the  a  priori 
probability  of  occurrence  of  the  two  hypotheses. 

If  the  signal  were  random  with  known  distribution  p(9),  the  optimum 
system  would  compute  an  average  likelihood  ratio  [see  Ref.  4]: 


e(xk> 


-/« 


xje)  p(e)  d9 


(2.2) 


In  the  problem  at  hand,  when  we  wish  to  take  advantage  of  all  prior 
information,  we  may  easily  show  that  the  optimum  system  computes  a  con¬ 
ditional  likelihood  ratio  [Ref.  15,  pp.  12-16],  that  is,  a  ratio  of 
probabilities  conditioned  on  the  past: 


p(xjA 


k -1 
k-1 


(2.3) 
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This  may  be  rewritten  in  a  more  useful  form  as  a  conditional  expectation: 


=  p(qI\.1)  d0  (2.4) 

We  have  assumed  that  0  is  the  only  unknown  parameter;  hence,  if  we  were 

I  I  *1* 

given  e  we  could  learn  nothing  from  A,  ;  thus  i(X  |9,A,  )  =  $(X  |g). 

The  synthesis  of  a  system  which  will  compute  this  latter  function  is  a 
standard  problem  of  detection  theory,  and  solutions  are  usually  known. 

The  problem  of  interest  involves  the  synthesis  of  a  system  to  compute 

p(0l\-l)  ' 

B.  AN  EXPONENTIALLY  GROWING  SOLUTION 

In  order  to  understand  the  difficulty  which  arises  when  we  attempt  to 
synthesize  a  system  to  compute  p(6|Ak  1),  we  may  take  the  following 
approach  (suggested  by  Daly  [Refs.  19,  20 ]  and  Scudder  [Ref.  lo]).  Sup¬ 
pose  that  we  knew  which  members  of  the  sequence  (X  ,X  , ...,X  }  con- 

A.  K  - 1 

tained  a  signal.  Then  we  could  use  these  members  in  a  machine  which 

learned  with  a  teacher  (Refs.  14,  15,  and  16  tell  us  how  to  construct 

such  machines).  If  we  lack  knowledge  to  learn  with  a  teacher,  we  may 

still  build  2k  1  machines  which  learn  with  a  teacher,  partition  the 

sequence  into  the  2^  1  possible  ways  in  which  the  (k-l)  observations 

might  be  classified,  and  feed  one  partition  into  each  machine.  Each 

partition  will  have  a  known  probability  of  occurrence;  thus  if  we  weight 

k-l 

the  output  of  the  2  learning  machines  by  the  appropriate  probabili¬ 

ties  of  occurrence  and  sum,  we  will  have  solved  the  problem.  Clearly, 
the  resulting  system  will  grow  exponentially  as  we  add  more  learning 
observations. 


This  amounts  to  an  assumption  of  conditional  independence  which  may  be 
written 


p( x1 , X^ , . . , ,  x^ 


.Hj) 


p(x1!g,h1)  p(x2le,H1) 


P(xkle,H1) 
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such  that 


To  be  more  precise,  we  define  a  binary  random  vector  Z 


(k) 


it  has  k  components.  Each  component  has  the  value  1  or  0  depending 
respectively  upon  whether  the  signal  is  or  is  not  present  in  the  obser¬ 
vation  which  that  component  represents.  [The  sequence  ([present,  not 

present,  present's  is  represented  by  the  vector  /lOl'>  =  z(^),] 

(  k ) 

Z;  is  an  ordered  k-tuple  with  binary-valued  components,  and  there 

1  k  (k)  (k) 

are  2  possible  Z:  .  These  may  be  ordered  by  letting  Z;  '  equal 

1  k  1 

the  binary  expansion  of  i  as  i  varies  from  0  to  2  -  1.  By  con¬ 
ditioning  the  distribution  p( 0 | A.  )  on  the  random  vector  Z^k\  and 

K  -  x  i 

averaging  over  all  i,  we  obtain 


k-1 


I  Kei\-i'z[k'1))  p(f'1)i\-i)  <2-s> 


i=0 


Thus  (2.4)  may  be  rewritten  as 


‘(xJVi)-  X  p(zik'1)iAk.1)  /£(xk|e)  p(e|Ak.1.zik‘1>) 


d0 


i=0 


(2.6) 


Both  of  the  conditional  distributions  in  (2.6)  may  be  expanded  in  terms 
of  known  functions;  e.g., 


p\  i  Ak-l/  _k-l 


2  -1 


(2.7a) 


v  p (.<“>) 


i=0 


p(\-l!z[k’l})  =  / p(\-l|z[k"l)>e)  Po(e)  de  ^2-7b) 
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(2.7c) 


k-1 


p(xk.ilz'k'1,.e)  =  n  -(x3|zi“'1>'e) 


j=l 


*(Xi\z(?ml)  >*)  -  { 


p( X  .  J  H. , 0 )  if  jth  component  of  Z^k  ^  =  1 

J  ^  * 


p(X.|H  )  if  j*'*1  component  of  z)k  =  0 

J  2  1 


r(k-l) 


( 2 . 7d) 


Thus  from  (2.6)  a  system  may  be  synthesized.  Unfortunately,  the  system 
will  grow  exponentially  as  the  learning  sequence  lengthens.  That  is, 

2  computations  are  required  for  the  optimum  utilization  of  k  learning 
observations.  As  the  length  of  the  learning  sequence  increases,  the 
system  grows  in  size  very  rapidly,  and  for  this  reason  it  does  not  seem 
practical  for  large  values  of  k.  When  we  are  interested  only  in  small 
values  of  k,  however,  this  type  of  system  may  be  quite  practical,  and 
may  even  result  in  a  less  complex  system  than  the  one  which  we  shall 
describe  in  the  next  section. 

The  conclusion  that  optimal  machines  which  learn  without  a  teacher 
are  impractical  for  large  k  might  seem  to  follow  from  the  above  argu¬ 
ment.  In  fact,  however,  this  is  not  the  case  as  will  be  demonstrated  in 
the  next  paragraph. 


C.  A  RECURSIVE  SOLUTION 

In  order  to  obtain  a  system  of  fixed  size,  we  return  our  attention 
to  p(©|A  )  and  proceed  as  follows. 

K  —  X 

p(e|\_l)  =  p(e|x1,x2 . xkl)  (2.8a) 
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or  by  Bayes'  law, 


p(e|\.1) 


P(xk_1|e,x1,...,xk_2)  P(e|x1 . xR_2) 


TTx. 


k-l 


IX1 . Xk-2) 


pOIV2> 


(2.8b) 


Consider  the  conditional  density  p(XR  2)'  There  are  two  P°ssi" 

bilities  for  X  •  K  may  be  true  or'  H„  may  be  true.  Thus 

K"1  1  Z 

p(XR  x  I  ® 2^  max  wri^ten  as  a  Mixture: 

p(Xk-l  I  0,^k-2^  =  p(Hi)  p^Xk-l  I  Hl’  0  ‘\-2^  +  P^H2^  P^Xk-l  ^H2’ 0,\-2^ 

(2.9) 

In  Eq.  (2.9)  we  have  assumed  that  p(H  )  and  p(H  )  are  known  a  priori. 

X  z 

In  many  interesting  problems  this  is  not  the  case.  The  problem  where 

p(H^)  and  p(h2)  are  not  known  a  priori  is  treated  later  in  this 

chapter.  If  H  is  true  and  0  is  known,  then  X  does  not  depend 

X  K  “  X 

on  A,  2>  The  noise  is  independent  of  the  signal;  therefore,  if  H2  is 
true,  X  does  not  depend  on  either  0  or  A.  ;  thus 

K  ”  X  K  -  Z 

p(xk_i!0,\_2)  =  P(Hi)  p(Xk-il6’Hi)  +  p(Xk_i  l«2)  p(H2)  (2-10) 

By  similar  reasoning  we  may  write 

p<Xk-lle’>'k-2)  ’  »<"l)  +  P("2)  p(,ik-l|H2>  (2'U) 

Finally,  by  factoring  and  rewriting  (2.8b),  (2.10),  and  (2.1l)  we  have 


p(elAk_x)  = 


'(Xk-Je)  +  a 

^\-l*\-2^  +  a 


p(e|Ak.2) 


(2.12) 
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where  a  =  p(H2)/p(H]L).  The  importance  of  Eq.  (2.12)  lies  in  the  fact 
that  it  has  a  recursive  form.  This  fact  will  allow  synthesis  of  a  system 
in  delay-feedback  form.  As  discussed  in  Chapter  V,  the  system  may  be 
realized  if  the  number  of  usefully  distinguishable  possible  values  of 
0  is  finite. 

If  we  review  the  computations  required  of  this  system  which  learns 
without  a  teacher,  we  find  that  we  are  in  a  position  to  synthesize  the 
system.  The  computations  required  are: 

1.  Compute  JJ(Xk|e)  for  each  possible  0 

2.  Compute  ?( © | ^k_1)  for  each  Posslble  9 

3.  Weight  (l)  by  (2)  and  sum  over  all  9. 

The  third  computation  will  result  in  We  have  assumed  that 

we  know  how  to  compute  i(xje).  Suppose  that  somehow  we  could  obtain 

_e(xk_il'Ak_2)  and  p(0l\-2)’  then  the  system  of  Fig-  1  would  provide 

the  desired  p ( 0  |  • 


i  (Xk.| 


33374 

FIG.  1.  A  SYSTEM  TO  COMPUTE  P(el\_1)- 

In  Fig.  1  and  throughout  this  report,  the  symbol  has  been 

used  to  indicate  a  zero-memory  device  which  has  as  an  output  the  ratio 
of  the  two  inputs.  The  input  marked  "n"  is  the  numerator,  that  marked 
"d"  is  the  denominator.  The  multipliers  and  adders 

are  also  zero-memory  devices. 

If  we  simply  store  p(9I^_]_)>  we  wil1  bave  available  for  com¬ 
putation  of  p(e|Ak)  when  the  next  vector  Xk+1  is  received.  Similarly, 
if  we  store  -.(xje),  it  will  be  available  when  Xk+1  is  received.  The 
system  shown  in  Fig.  2  is  one  form  of  the  required  learning  system. 


-<+>~ 
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FIG.  2.  A  BINARY  DETECTOR  WHICH  LEARNS  WITHOUT  A  TEACHER 
(SEQUENTIAL  form). 


There  are  several  facts  concerning  this  system  which,  although  self- 
evident,  should  be  considered.  First,  the  computation  of  (i ( X |  0 )  and 
p(0|Ak  )  must  be  made  for  every  possible  value  of  0.  The  integrator 
must  be  synchronized  with  the  sequential  variation  of  0.  Second,  when 
the  machine  is  started  an  initial  distribution  of  9,  or  Po(9),  must 
be  inserted.  This  distribution  may  be  uniform  over  9,  or  it  may  have 
any  convenient  form  consistent  with  our  a  priori  knowledge  of  9. 

The  fact  that  the  computation  of  ii( X 1 0 )  and  p(b|A.  ,)  must  be 

K  “  X 

made  for  every  possible  value  of  9  poses  a  difficult  problem.  If  0 
varies  in  a  continuous  space,  there  will  be  an  uncountable  infinity  of 
possible  values,  and  the  various  components  of  Fig.  2  will  not  be  real¬ 
izable  exactly.  We  shall  circumvent  this  problem  by  assuming  that  the 
space  of  6  can  be  quantized,  so  that  the  system  need  compute  £(x|©) 
for  only'  a  finite  number  of  values  of  8.  Later,  in  Chapter  V,  we  shall 
demonstrate  that  a  quantized  space  may  always  be  chosen  so  that  the  per¬ 
formance  of  a  system  based  on  this  space  will  be  arbitrarily  close  to 
the  performance  of  the  theoretical  system. 

The  assumption  that  the  space  of  8  has  a  finite  number  of  points 
allows  us  to  represent  the  system  in  an  alternative  form  as  illustrated 
in  Fig,  3.  In  this  form  the  system  computes  ■  ( X  |  9 )  and  p  ( 8?  |  A.  .)  and 
takes  the  product  simultaneously  for  all  values  of  8.  The  products  thus 
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formed  are  summed,  and  the  result  is  ^(X ,  | A.  ,).  Both  the  sequential 

JK  K-  i. 

and  the  parallel  forms  of  the  system  will  be  used  in  the  various  examples 
in  this  and  following  chapters. 


33365 

FIG.  3.  A  BINARY  DETECTOR  WHICH  LEARNS  WITHOUT  A 


TEACHER  (PARALLEL  FORM). 

D.  EXAMPLES 

The  system  which  has  just  been  derived  may  be  applied  directly  to  a 
wide  variety  of  signal -detection  problems.  This  application  requires  only 
that  we  determine  an  explicit  expression  for  £ ( X | 9 )  and  synthesize  a 
system  to  compute  the  expression.  The  following  examples  demonstrate  this 
procedure. 
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1.  Detection  of  a  Signal  of  Unknown  Amplitude 


In  this  example  we  shall  consider  the  problem  of  detecting  a  signal 
of  known  waveform  but  unknown  amplitude  embedded  in  additive  noise.  Such  a 
problem  might  arise  if  we  were  to  use  for  a  communication  channel  a  medium 
which  faded  sb  slowly  that  the  attenuation  could  be  considered  constant. 

(For  a  more  realistic  consideration  of  the  fading -channel  communication 
problem,  and  an  application  of  this  example,  see  example  1  of  Chapter  IV. ) 

We  assume  that  the  signal  to  be  detected  may  be  written  as  the  prod¬ 
uct  of  an  unknown  scalar  and  a  known  bandlimited  waveform  of  duration  T. 

s(t)  =  cb(t) 

where  b(t)  =  known  waveform  of  bandwidth  W,  duration  T 
c  =  unknown  scalar 

The  signal  is  embedded  in  a  background  of  additive,  gaussian  noise  of  zero 
mean  and  covariance  matrix  K, 

In  our  hypothetical  problem  we  are  given  a  received  waveform  x(t) 
(perhaps  an  i-f  amplifier  voltage)  which  starts  at  time  zero  and  continues 
to  the  present.  For  simplicity  we  assume  tha  the  signal  is  of  duration 
T  and  may  only  start  at  instants  separated  from  a  known  synchronization 
instant  by  integral  multiples  of  T.  The  signal  is  transmitted  at  inter¬ 
vals  chosen  at  random,  and  our  problem  is  to  look  at  the  received  waveform 
for  a  duration-  T  (x(t);  (k-l)T  S  t  §  kT)  and  decide  whether  or  not  the 
signal  is  present. 

In  order  to  easily  manipulate  the  appropriate  variables  we  shall 
take  advantage  of  the  representation  of  continuous  bandlimited  waveforms 
by  vectors  of  the  sample  values  of  the  waveform.  We  shall  denote 


's(O) 

b(0) 

’n(0) 

s  = 

S(i*) 

B  = 

»(-s) 

N  = 

n(^) 

f(T  -  £) 

b(T  -  s) 

_ 

n(T  -  Sr) 

-  m 

where  n(t)  is  the  noise  waveform. 
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We  divide  the  received  waveform  into  "observations"  of  duration 


T,  and  denote  these  by  the  indexed  vectors  X^: 


We  define  two  hypotheses  which  apply  to  each  observation: 

=  the  hypothesis  that  =  cB  + 

H2  =  the  hypothesis  that  Xfc  = 

The  optimum  learning  system  must  compute  the  likelihood  ratio,  £(x|c), 
which  is  given  by 


£  (  X  [  c ) 


p(x|c,H1) 

~  p(xfc,H2)  ~  eXP 


(- 


1 

2 


c2BtK-1B 


+  cXtK 


In  order  to  vary  the  computation  over  all  c,  we  restrict  c  to  some 
range,  say  r  §  c  <r  r  .  To  easily  construct  the  system  we  make  c  a 
function  of  time,  and  integrate  over  time  instead  of  c;  that  is,  we 
sweep  c  linearly  from  r  to  r  .  If  we  make  the  sweep  period  T 

i.  Ct 

the  same  as  the  observation  interval,  synchronization  will  be  much  more 
easily  achieved.  The  resulting  system  is  shown  in  Fig.  4. 

In  this  system  the  input  vector  X  is  transformed  by  the  matrix 

-1/2  "l/2 

operator  K  '  to  yield  the  vector  K  '  X.  This  vector  is  multiplied, 
term  by  term,  by  the  vector  K  B  and  the  terms  summed  (accumulated 
for  T  sec)  to  provide  X^K  1B.  This  product  is  a  scalar  and  is  the 
value  of  the  accumulator  sampled  at  the  appropriate  instant.  This  sample 
is  held  for  T  sec  while  the  gains  c  are  swept  through  the  range. 
During  this  per.iod  the  contents  of  the  accumulator  are  dumped  and  the 
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r  xio) 
x(A) 
x(2A> 


L  *<T) 


a.  The  sweeping  likelihood  computer 


-A 


X— ]  AX|c) 

jl 


□ 


SWEEP 
GENERATOR 


<$> - 


DELAY 

OELAY 

T 

T 

INTEGRATE 
OVER  T 
AND  DUMP 


INVERT 


SAMPLE 
AT  nT 
AND  HOLD 


15fc.ix.k-i> 
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b.  The  detection  system 


FIG.  4.  A  LEARNING  SYSTEM  FOR  DETECTION  OF  SIGNALS  OF 
UNKNOWN  AMPLITUDE. 


product  involving  the  next  observation  is  accumulated.  Thus  once  every 
T  sec  the  parameter  c  is  swept  through  its  range  (r  to  r  )  and 

X.  Ci 

the  output  C(x|c)  is  swept  through  the  range  of  c. 

2.  Detection  of  a  Narrowband  Signal  of  Unknown  Frequency 

A  problem  which  arises  often  in  the  different  phases  of  electronic- 
countermeasures,  reconnaissance,  and  communications  fields  is  the  detection 
of  a  narrowband  signal  of  unknown  frequency  f,  random  amplitude,  and 
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random  phase.  The  problem  of  detecting  such  a  signal  when  only  the 
current  observation  is  used  has  been  discussed  by  Helstrom  [Ref.  22] 
and  Wainstein  and  Zubakov  [Ref.  23]  among  others.  These  authors  suggest 
the  use  of  a  receiver  which  uses  a  bank  of  narrowband  filters  centered 
at  each  possible  frequency.  The  filter  output  which  is  maximum  is  com¬ 
pared  to  a  threshold  to  determine  whether  a  signal  is  present  or  not. 

The  performance  of  such  a  receiver  (called  a  "maximum  likelihood" 
receiver  by  Helstrom)  is  evaluated  by  Wainstein  and  Zubakov.  Such  a 
receiver  is  shown  to  have  performance  which  is  nearly  as  good  as  the 
performance  of  the  Bayes  or  average-likelihood  receiver  [Ref,  23]  without 
learning.  In  many  cases,  however,  this  performance  is  not  adequate  (see 
Fig.  10  of  Chapter  III)  and  it  is  desirable  to  take  advantage  of  the  fact- 
that  the  signal  is  recurring  at  the  same  frequency,  that  is,  it  is 
desirable  to  apply  the  techniques  of  learning  to  the  problem. 

If  we  did  not  desire  to  use  more  than  k  past  observations  to 
learn  f,  we  could  use  2  receivers  constructed  to  learn  the  frequency 
with  a  teacher  as  explained  in  Sec.  B.  (See  also  Refs.  19,  20,  and  24.) 
However  for  even  moderate  k  such  a  receiver  would  be  impractical. 

In  the  following  paragraphs  we  shall  derive  the  optimum  learning 
receiver  for  this  problem.  We  shall  see  that  it  consists  essentially  of 
a  bank  of  periodogram  calculators  (which  are  approximately  narrowband 
filters)  whose  outputs  are  the  inputs  to  a  bank  of  antilog  devices.  The 
antilog  device  outputs  are  weighted  by  the  learned  probability  distribu¬ 
tion  of  frequency  and  summed.  The  sum  is  the  desired  likelihood  ratio. 
Mathematically,  we  proceed  as  follows. 

Assume  that  the  signal  s(t)  may  be  represented  by  a  sample 
function  of  a  narrowband  gaussian  random  process  over  the  interval  T 
when  the  signal  is  present.  The  sample  functions  are  independent  from 
one  interval  to  the  next,  and  the  occurrence  of  a  signal  in  one  interval 
is  independent  of  its  occurrence  in  other  intervals.  We  also  assume 
that  the  signal  can  start  only  at  times  which  are  separated  from  a  known 


For  other  discussions  of  this  problem  see  Refs.  21  and  24. 
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synchronization  instant  by  integral  multiples  of  T.  Thus  over  any 
interval  T  the  signal  may  be  described  by  Eq .  (2.13).^" 

s(t)  =  a  cos  (cot  +  $)  (2.13) 


where  a  is  a  random  variable  with  Rayleigh  distribution: 


1° 


aSO 


a  <  0 


♦  is  a  uniformly  distributed  random  variable 


0  §  $  S  2jt 


elsewhere 


(2.14) 


(2.15) 


f  =  uj/ 2ix  is  unknown,  except  that  it  must  be  one  of  a  discrete 
set  of  frequencies  {f  , f  ,  ..., f  } 

JL  Ct 

The  assumption  of  synchronization  may  be  relaxed,  and  the  syn¬ 
chronization  time  treated  as  an  unknown  parameter.  The  problem  becomes 
much  more  complex,  and  would  not  serve  as  a  good  illustration  at  this 
point.  An  alternative  technique  when  synchronization  is  unknown  is  to 
choose  the  interval  T  to  be  very  short  compared  to  the  signal  duration, 
and  to  take  into  account  the  resulting  signal  dependence  from  interval 
to  interval.  This  latter  approach  may'  be  accomplished  by  treating  the 
probability  p(H^)  as  a  time-varying  random  parameter,  thus  combining 
the  techniques  of  this  and  the  next  chapter. 


See  Refs.  25,  26  for  a  description  of  the  properties  of  narrowband 
gaussian  random  processes.  Equation  (2,13)  merely  expresses  the  fact 
that  such  a  process  may  be  described  in  terms  of  two  independent 
random  variables,  the  amplitude  and  the  phase. 
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The  noise  is  assumed  to  be  additive  and  normally  distributed 
with  covariance  matrix  K. 

Because  the  problem  is  to  observe  the  received  waveform  over 
intervals  of  duration  T  and  to  make  repetitive  decisions  at  the  end 
of  each  interval,  we  define  two  hypotheses  which  apply  to  each  obser¬ 
vation: 


=  the  hypothesis  that  x(t)  =  s(t)  +  n(t) 

Hg  =  the  hypothesis  that  x(t)  =  n(t) 

where  x(t)  =  received  waveform 
n(t)  =  noise 

The  optimum  learning  system  must  compute  the  conditional  likeli¬ 
hood  ratio: 


Q 

«xkK-i>  - 1  <2-16> 

i=l 

where  X^  is  the  k^*1  2TW-dimensional  column  vector  of  samples  of  x(t) 
sampled  at  the  interval  A  =  l/(2W),  and  W  is  the  bandwidth  within 
which  the  frequency  must  fall. 

To  synthesize  a  system  to  solve  this  problem,  we  must  express 
£(x|f  )  and  P(f.|A  )  explicitly.  This  may  be  done  as  follows. 

1  1  K  ■"  X 

First  we  express  £(X^|  a,  <i> ,  f  )  explicitly  and  then  average  over 
a  and  <t .  To  this  end  we  may  use  p(a)  and  p($)  as  defined  in  Eqs . 
(2.14)  and  (2,15)  since  the  current  (k^*1)  values  of  a  and  $  are 
independent  of  each  other  and  of  previous  and  future  values  of  a  and 
$.  Thus 


:(xkif 


A  I 


r 0 

r  2n 

i  p(a) 
'0 

J  P(i)  ;(xkla,t,f.)  d* 

da 


(2.17) 
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Because  of  the  normality  of  the  noise,  this  integral  may  be  carried  out 
to  yield 


C. 


-0(xklfi)  =  exP 
A 


T  lxkt 


(2.18) 


where 


C. 

1 


+  1 


1 

1 

B(*t)  = 

exp  (j2itf  A) 

•-  J<fi>  - 

cos  ^2flf^A^ 

exp  (j2*f  T) 

cos  ^rtf^T^ 

Ri  =  W  K  lj(fi) 


If  the  noise  is  stationary,  K  ^E(f^)  represents  the  sampled- 
data  form  of  the  output  of  a  linear  filter  with  system  function  the 
reciprocal  of  the  noise  spectral  density  when  exp  (jcu^t)  is  the  input. 
Thus  the  effect  of  assuming  that  the  noise  is  not  white  may  be  taken 
into  account  by  the  introduction  of  a  multiplicative  constant  (depending 
on  f  )  in  the  exponent.  Let  S^.(f)  noise  spectral  density,  then 


XtE(ft)i2 


(2.19) 


where  Di  =  l/^f^)  .  The  quantity  |  X^_E( f  ^  )  |  may  be  recognized  as 
being  proportional  to  the  sampled -data  form  of  the  periodogram  of  x(t), 
since 
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2 


2TW 

2 

rT 

!*,*<*!> 1  - 

x(mA)  exp  (joxmA) 

=  4W2 

1  x(t)  exp  (jo^t)  dt 

m=l 

*'0 

(2.20) 


and  the  periodogram  (at  f  =  f  )  is  defined  as 

^  [x(t)]  =  i  |  x(t)  exp  (jo^t) 


Per . 


dt 


(2.21) 


Hence  the  system  is  required  to  compute  the  periodogram  of  x(t)  at 
each  of  the  frequencies  (f ^ ;  i  =  1,2, to  weight  each  of  these 
computations  by  C^D^/2,  to  take  the  antilog  of  the  result,  and  to 
weight  the  antilog  by  Cl/a2.  This  operation  may  be  performed  sequentially 
by  a  single  circuit  or  in  parallel  by  a  bank  of  Q  circuits.  (if  it  is 
performed  sequentially  and  if  the  frequencies  are  taken  to  be  a  set  within 
the  band  W  separated  by  l/2T  cps,  the  circuit  required  may  be  identi¬ 
fied  as  a  form  of  time-compressive  sweeping  receiver  which  sweeps  the 
band  W  in  time  T  with  resolution  l/T  [Refs.  23,  24].  The  construc¬ 
tion  of  such  a  receiver  is  quite  possible;  however  it  may  be  somewhat 
confusing  to  introduce  the  concept  at  this  point  and  therefore  we  shall 
utilize  the  parallel  form  of  receiver.) 

Since  we  have  found  the  form  of  t(x|f^),  the  problem  is  solved 
in  the  parallel  form  by  inserting  the  £(x|f^)  computer,  defined  by 
Eq.  (2.18)  and  illustrated  in  Fig.  5,  into  the  appropriate  box  in  Fig.  3 
(identifying  0^  of  Fig.  3  with  f  of  Fig.  5),  The  result  is  the 
system  of  Fig.  6. 


E.  LEARNING  THE  A  PRIORI  PROBABILITIES 

In  the  model  originally  proposed  it  was  assumed  that  the  a  priori 
probabilities  p(H^)  and  p(^2)  were  known  but  that  some  other  parameter 
was  unknown.  In  many  problems  only  the  a  priori  probabilities  are 
unknown,  and  in  other  problems  both  the  a  priori  probabilities  and 
other  parameters  are  unknown. 
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FIG.  5.  A  COMPUTER  FOR  ^(X | f ± ) . 


33363 


FIG.  6.  A  LEARNING  RECEIVER. 
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The  solution  to  such  problems  cannot  be  obtained  by  treating  p(H^) 
and  p(h2)  as  signal  parameters  because  they  do  not  appear  in  the  equa¬ 
tions  in  the  same  manner.  Since  it  is  the  purpose  of  this  section  to 
outline  the  proper  solution,  we  shall  begin  by  assuming  that  p(H1)  =  p^ 
and  p(H  )  =  p  are  the  only  unknown  parameters.  If  they  were  known, 
the  optimum  system  could  be  realized  by  computing  ^(xk)  and  comparing 
it  to  a  threshold  Lp„/p.  as  previously  noted.  However,  another  optimum 
system  is  one  which  computes  (p^/p^^X^  and  compares  it  to  L.  By 
utilizing  this  latter  system  we  may  show  that  when  p(H  )  and  p(Hr ) 
are  unknown  but  a  sequence  of  learning  observations  A^  is  available, 
the  optimum  system  computes  the  conditional  expectation  of  (p,/p0)$(x  ), 
defined  in  (2.22),  and  compares  it  to  the  threshold  L  [Ref.  14].  The 
conditional  expectation  is 

L<xJVi>  =  /*<V  a»i  <2-22> 


where  we  have  taken  advantage  of  the  fact  that  p  =  1  -  p  .  Now  fol- 
lowing  the  procedure  which  led  to  Eq.  (2.12)  we  have  by  Bayes'  law 


p(pllAk_l)  = 


p(xk-ilprAk-2)  p(pilAk-2} 

/p(xk-i!pi'Ak-2)  p(pilV2)  dpi 


(2.23) 


Since  p^  is  the  only  unknown  variable,  we  may  write 


p(Xk-l,Pl’Ak-2)  =  p(Xk-l|Hl'Ak-2’PJ  P^JPl’W 


+  P^Xk-l^H2’ \-2’Pl^  P^H2^Pl’Ak-2^  (2.24) 


But  when  we  know  that  X,  ,  comes  from  the  class  of  observations  which 

k-1 

contain  signal,  we  know  the  probability  density  function  of  X^  ;  hence 


p(xk-ll"l,\-2’pl)  =  p(x 


k-1 


lHl) 


(2.25a) 
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Similarly, 


p(xk-i  lH2’\-2’P2^  p(Xk-2^H2^ 


(2.25b) 


that  is,  the  observations  are  conditionally  independent  of  the  past  and 
of  the  value  of  p  when  either  H  or  H  is  given.  When  the  value 

X  1  Xd 

of  p  is  known,  p(H  )  and  p(H  )  are  known,  so  that 

1  X 

p(H1|p1,Ak_2)  =  Px  (2.26a) 

p(H2!Pl'\-2)  =  1  -  pl  (2.26b) 

From  Eqs.  (3.24),  (2.25a),  and  (2.25b)  we  may  write  p(p1l\  in  terms 
of  the  likelihood  ratios  as  follows: 


p(pil Ak-1)  =  p(pilAk-2) 


-^(Xk-i^pi  +  *  pi) 


/ *-^Xk-l^Pl  +  ^  "  pl^  p^pll\-l^  dpl 


(2.27) 


Thus  (2.22)  may  be  rewritten  in  the  form 


L(XklAk-l)  = 


/^(sk}  (^xk-i)pi +  ^  ■  pi)]  p(pi|Ak-2}  dpi 


fiUx k-l)pi  +  "  Px))  p(piIAk-2^  dpi 


(2.28) 


and  a  system  may  be  synthesized  in  the  form  of  Fig.  7. 

The  solution  when  other  parameters  are  also  unknown  is  very  similar 
since  in  this  case  we  have  the  basic  equation: 


L<xkl\-i>  -  // ‘<xk!e)  dpi de  (2-29> 
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But  p(e,p  [A,  i )  may  be  written  as 

1  K  “  1 

P(9.P1I\_1)  =  ?(©!?!»  ^k_! )  p(Pil\_i)  (2.30) 

and  its  computation  performed  by  the  systems  of  Fig.  3.  Therefore  by 
writing 

L(xklpi>\-i)  =  J* ^(xk! ©)  p^iPrVi)  de  (2.31) 

(which  is  computed  by  the  system  of  Fig.  3  with  a  suitable  choice  of  a), 
Eq,  (2.29)  becomes 

l<xA-i>  -  /"AK-W  dpi  <2-32) 

which  is  functionally  very  similar  to  (2.22).  Thus  the  system  of  Fig.  7 
with  $(  x)  replaced  by  L(x  |p  , A  )  is  the  required  system. 

K  X 
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FIG.  7.  A  SYSTEM  DESIGNED  TO  LEARN  THE  A  PRIORI 
PROBABILITIES. 
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F,  EXTENSION  TO  THE  MULTIPLE -HYPOTHESIS  DECISION  PROBLEM 


.In  the  multiple-hypothesis  problem  we  are  given  M  possible  classes 

into  which  we  must  categorize  the  vector  X,  There  are  (M-l)  possible 

errors  associated  with  each  of  the  M  classes.  The  Bayes  optimum  solu- 

2 

tion  depends  upon  the  M  different  weights  which  may  be  assigned  to 
each  error;  that  is,  a  general  solution  requires  the  comparison  of  weighted 
a  posteriori  probabilities 

p(x|H.)  i  =  1,2,. ...M 


where 


H^  =  hypothesis  that  X  is  in  class  i 

A  general  form  of  the  optimum  system  is  shown  in  Fig.  8.  From  this  solu¬ 
tion  it  can  be  seen  that  in  multiple-hypothesis  testing  the  conditional 
probability  p(x|».^)  plays  the  same  role  as  the  likelihood  ratio  plays 
in  the  binary  detection  problem.  In  order  to  obtain  a  learning  solution 
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FIG.  8.  A  MULTIPLE -HYPOTHESIS  MACHINE. 
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we  assume  that  each  class  is  characterized  by  an  unknown  vector  A. , 

1 

and  apply  the  same  reasoning  that  was  applied  to  the  signal -detection 
problem.  (We  assume  that  the  A  are  independent,  that  the  X^  are 
conditionally  independent,  and  that  the  p(H^)  are  known.)  That  is, 

p(X1,...,Xk|A.,H.)  =  p(X1|A.,H.)  p(X2|A1,Hi)  ...  P^Ja.,!^) 

p(ax i • • • > am)  =  p(AX)  p(A2)  •••  p(am) 

The  solution  is  given  by 


P(Xk|Hi.Ak_i)  .  /p(Xk|Hi.Ai) 


r  m  i 

!p(xk-i  iHi’Ai)  p(Hi)  +  y  p(xk-iiHj'Ak-->)  p(Hj)) 

L _ ja _  J 


1 
j  =  l 


*  P(Ail\.2)  (2-33) 

This  equation  shows  the  same  recursive  form  shown  by  Eq.  (2.12),  and 
leads  to  a  similar  system  as  depicted  in  Fig.  9.  Since  this  system  com¬ 
putes  only  one  of  the  required  M  conditional  probabilities,  there  must 
be  (M-l)  more  systems  that  are  identical  except  for  initial  probability 
distribution  PQ(A^)  and  probability  of  occurrence  p(lh).  These  will 
in  general  be  different;  but  in  the  case  where  the  p(H.,)  are  all  the 
same,  the  PQ(A^)  must  be  different,  or  all  computer  branches  will 
"learn"  the  same  thing,  and  the  system  as  a  whole  will  learn  nothing. 

It  is  interesting  to  note  that  Eq.  (2.33)  verifies  our  intuitive 
feeling  that  if  we  do  not  have  some  initial  knowledge  that  the  patterns 
to  be  recognized  are  somehow  different,  we  are  not  able  to  learn  without 
some  external  aid. 
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FIG.  9.  A  MULTIPLE -HYPOTHESIS  MACHINE  WHICH  LEARNS  WITHOUT  A  TEACHER. 


G.  SUMMARY  OF  CHAPTER  II 

In  this  chapter  we  have  mathematically  described  a  class  of  decision 
problems  in  which  the  pertinent  probability  measures  are  known  except  for 
some  set  of  fixed  parameters.  We  have  developed  a  class-  of  systems  which 
will  solve  such  problems  when  a  sequence  of  "learning"  observations  is 
available  that  contains  information  about  the  unknown  parameter.  This 
class  of  systems  may  take  the  form  of  either  the  "sequential"  or  the 
"parallel"  canonical  systems  of  Figs,  2  or  3.  The  resulting  systems  are 
optimum  at  each  decision  instant  in  the  sense  that  of  all  possible  sys¬ 
tems  based  on  the  same  a  priori  information  and  utilizing  the  same  set 
of  observations,  these  systems  will  provide  the  minimum  average  risk 
decision.  The  systems  are  also  fixed  in  size  for  arbitrary  learning 
sequences.  The  optimality  and  fixed  size  represent  important  advantages 
over  both  conventional  and  prior  learning  systems. 
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III.  PERFORMANCE  OF  LEARNING  SYSTEMS 


It  is  the  purpose  of  this  chapter  to  investigate  techniques  for 
determining  bounds  on  the  performance  of  the  previously  developed  learning 
system  in  specific  cases,  and  to  gain  some  insight  into  the  way  in  which 
this  performance  depends  upon  the  number  of  samples  in  the  "learning"  set. 

A.  PERFORMANCE  MEASURES 

This  chapter  is  concerned  with  two  aspects  of  system  performance. 

T1  j  first  is  the  average  risk  associated  with  a  decision,  the  second  is 
the  rate  at  which  the  system  converges  to  the  optimum  system  given 
a  priori  knowledge  of  the  parameter. 

The  average  risk  for  the  binary  decision  problem  [Ref.  3]  may  be 
defined  as 

p  -  +  n>2Pn  (3.1) 

where  p^  =  a  priori  probability  of  hypothesis  1  being  true 

P2  =  1  -  p^  =  a  priori  probability  of  hypothesis  2  being  true 
Pj  =  probability  of  deciding  that  Hg  is  true  when  is 

actually  true 

Pjj  =  probability  of  deciding  that  is  true  when  Hg  is 

actually  true 

L  =  cost  of  a  type  II  error  relative  to  a  type  I  error 

It  will  be  convenient  to  discuss  performance  in  terms  of  the  signal- 
detection  problem.  In  this  case  P^.  becomes  the  probability  of  a  miss 
PM  and  Pjj  becomes  the  probability  of  false  alarm  P^A  (H^  =  signal 
present,  =  signal  absent). 

The  rate  of  system  convergence  may  be  measured  in  terms  of  the 
decrease  of  the  difference  between  transient  average  risk  and  steady- 
state  average  risk  as  a  function  of  the  number  of  observations  (k). 

This  measure  will  be  called  the  "risk  error"  and  defined  as 
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eP  =  p(d*(\))  "  p(d*(e))  (3-2) 

average  risk  of  the  system  in  the  transient  state 
after  k  learning  observations 

average  risk  of  the  system  given  a  priori  knowledge 
of  0 

B.  A  TECHNIQUE  FOR  BOUNDING  THE  PERFORMANCE 

Although  there  are  several  possible  approaches  to  the  evaluation  of 
performance,  only  one  will  be  presented.  This  approach  is  applicable  to 
the  class  of  learning  problems  restricted  to  those  which  may  be  expressed 
in  terms  of  the  detection  of  one  of  a  finite  set  of  signals  embedded  in 
noise.  If  the  noise  is  additive,  white,  and  normally  distributed  and  if 
the  signals  are  orthogonal,  this  approach  results  in  some  remarkably 
simple  results  which  are  in  good  agreement  with  intuition.  For  more 
general  problems,  the  results  are  so  dependent  upon  the  particular  prob¬ 
lem  that  no  useful  or  enlightening  information  has  been  uncovered. 

Assume  we  have  synthesized  a  system  to  detect  the  presence  of  a 
signal  of  unknown  waveform  which  must  be  drawn  from  a  set  of  m  signals 

(S, ,S„ . S  }  depending  on  a  discrete  set  of  parameters  (0, ,@„ . 0  }: 

1’  2  m  1  2’  ’  m 

i.e.,  S.  =  s(0,).  Let  the  signals  be  embedded  in  noise.  Then  the 

l  '  l 

optimum  learning  system  will  compute 


where  p(d*(\.))  = 
p(d*(e))  = 


m 

i(x|\)  =  V  p(ej\)  ^(xlei)  (3-3) 

i=l 


and  compare  it  to  a  threshold 


7  =  Lp2/pr 


Let 


v  =  true  value  of  0 

d*(Aj. )  =  the  Bayes  decision  rule  based  on  A^ 

d  Bayes  decision  rule  given  knowledge  of  B 

d'(\  )  =  any  non-Bayes  decision  rule  based  on  A, 

K  K 

(.(d)  =  average  risk  associated  with  decision  rule  d 
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The  average  risk  after  k  observations  will  be  greater  than  the 
risk  given  that  the  0  were  known: 

p(d*(\))  S  p(«*(e))  (3.4) 

The  risk  will  be  less  than  the  risk  of  any  other  system  based  on  Ak: 

p(d*t*k))  §  p(d'(Ak))  (3.5) 

These  properties  follow  from  the  Bayes  nature  of  the  decision  rule.  The 
optimum  system  is  sketched  for  convenience  in  Fig.  10. 


33380 

FIG.  10.  AN  OPTIMAL  LEARNING  SYSTEM. 


To  evaluate  a  bound,  consider  a  suboptimum  system  which  computes 
P(6i|Ak)  and  ( X  |  ©  ) ,  Let  this  system  determine  the  6^  for  which 
P(6i|Ak)  is  largest  and  compare  the  corresponding  £(x|g  )  with  the 
threshold  y  to  make  a  decision."^  If  P( ©  |  Ak )  is  greater  than  l/2 

4. 

'This  suboptimum  system  is  closely  related  to  the  "maximum-likelihood 
receiver"  of  Helstrom  [Ref,  22,  p.  238 ]  and  to  the  "type  III"  receiver 
of  Wainstein  and  Zubakov  [Ref.  23,  p.  297];  however  it  is  different  in 
that  it  takes  the  past  into  account. 
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it  will  be  largest,  and  the  suboptimum  system  will  have  the  same  average 
risk  as  the  system  based  on  knowledge  of  9.  If  P ( G  J  A ^ )  =i  l/2,  it  may 
not  be  the  largest,  and  an  incorrect  likelihood  computer  may  be  chosen. 

If  the  wrong  9.  is  chosen,  the  risk  may  be  bounded  as  follows. 
Define  P^(e)  and  P^1  (@)  as 


Pp^(e)  =  Pr  U(x|e.)  >  r |h2,9} 


/  ^ 
e.  4  9 

1  r 


(3.6a) 


p^(e)  =  Pr  U(x|9.)  <  rl^.e}  e±  4  Q  (3.6b) 


where  P  =  Pr  (i(x| 8) > 7 |h  ,  0}  is  the  probability  of  false  alarm  when 
FA  ^ 

/N 

0  is  known.  Let 


SA(g)  =  max  PFA)(®}  ‘  P 
i=l , . . . , Q 


FA 


(3.6c) 


C  <e)  =  m.,  -  1  *  P. 

1=1, . . . ,Q 


FA 


(  3 . 6d  ) 


These  two  latter  quantities  are  small  in  many  interesting  problems. 

For  example,  .  e  (0)  =  0  whenever  the  distribution  of  ^ ( X  J  ©  )  con- 
FA  i 

ditioned  on  H  and  9  is  independent  of  9.,  as  is  the  case  in  the 

detection  of  a  set  of  signals  in  additive  normal  noise  when  the  signals 

are  orthogonal  after  whitening  (i.e.,  let  K  be  the  covariance  matrix 

of  the  noise),  then  the  whitened  signals  are  orthogonal  if 

S ( 6 . ) , K  1S(0.)  =  6.  .R,  where  5..  =  1  if  i=j,  and  6.  .  =  0  if 
v  i  t  '  J  ij  ij  ij 

i  ^  j.  In  this  particular  case  we  may  write 

i  ( x I  © ± )  =  exp  |  R  +  s(9i)t.K  1X 


But  since  we  may  replace  X  by 

S(e)  +  N  when  is  true 
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and  by 


N  when  H  is  true 

then  the  orthogonality  of  8(0^)  and  s(0)  allows  us  to  write,  for 
both  and 


J0(x|0  )  =  exP 


-1“ 


s(e. )tK*1N 


(3.7) 


Thus  the  distribution  of  JJ(x|0  )  will  be  the  same  whether  H  or  H„ 

1  X  a 

is  true,  and  £^(9)  will  be  zero. 

Next,  we  define  two  events,  A,  and  B,  as: 

k  k 

A,  =  event  P(  9  |  A.  )  >  P(  0  .  |  A.  )  for  all  0.^0 

K  K  X  K  i 


Bk  =  event  P(9|Ak)  S  P( 0^ | )  for  some  0^  ^  0 


The  risk  of  the  suboptimum  system,  when  8  is  true,  is  bounded  by 


p(d'(\)|0)  S  p(d*(9)|e)  P(Ak)  +  P(Bk)  max  Lp^Cg)  + 

(3.8) 

Inserting  the  definition  of  E^(0),  £fa(®)>  and  p  ^d^'(  0 )  |  0  in  this 

expression  gives 

t(d'(Ak)!e)  <  P1?„P(Ak)  +  Lp2PFA[p(Ak)  ♦  P(Bk)] 

*  P(Bk)(Pl(l  +  6„)  *  Lp2€FA  -  fFA)  (3.9) 


where 


pm  =  Pr  U'(x|e)  <  r|H1,e) 
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is  the  miss  probability  when  0  is  known,  and  the  dependence  of  € 


M 


and 


e„.  on 
FA 


0  has  been  suppressed. 


For  most  problems  we  are  interested  in  the  region  where  and 

j?FA  are  small  compared  to  1,  and  P(A^)  nearly  one,  so  that  by 


combining  (3.5)  and  (3.9)  we  obtain 


p(d*(Ak)|e)  £p(d*(0)|e)  +  P1P(@k)  +  px€m  +  Lp2eFA  (3.10) 


Thus  we  may  identify  the  risk  error  as 


e  =  PlP(Bk)  +  PleM  +  Lp2e 


FA 


(3.11). 


Because  the  system  performance  is  dependent  on  0  through  p(Bk), 

ew(0),  etc.,  the  application  of  this  bound  to  any  particular  problem 
Mx 

is  difficult;  and  as  the  performance  becomes  more  and  more  dependent  on 
0,  the  bound  becomes  less  useful  because  it  is  more  and  more  difficult 
to  obtain  an  evaluation  of  P(Bk).  To  obtain  some  insight  into  the  nature 
of  this  bound,  let  us  evaluate  the  bound  for  the  problem  of  detection  of 
a  signal  embedded  in  additive  white  gaussian  noise  when  the  signal  wave¬ 
form  is  unknown.  The  signal  may  take  on  one  of  m  orthogonal  waveforms. 
As  noted  previously,  the  normality  of  the  noise  and  the  orthogonality  of 
the  signals  insure  that  when  the  signal  is  not  present  the  choice  of 
the  proper  0^  does  not  affect  false-alarm  probability;  that  is, 


P^^(0)  =  P 
FA  K  '  FA 


for  all  0j 


so  that  e  .(e)  =  0.  Similarly,  the  normality  of  the  noise  and  orthog- 
FA 

onality  of  the  signals  insure  that 


p(i)(£)  =  !  .  p(i)(e) 
M  v  '  FA  ^  ' 


for  all  9^+9 
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where 

E(s(ei)t  s(e±)} 

R  -  - - -  is  a  constant  signal -to-noise  ratio 

a ~ 
n 

m  =  number  of  orthogonal  signals 

p^  =  probability  of  signal  occurrence 

p  =  1  -  p  =  probability  no  signal  will  occur 
ct  X 

k  =  number  of  observations  in  the  learning  sequence 
For  large  k  this  bound  is  asymptotic  to 

4(1  +  PjPgR) 

P1kR 


Thus  the  system  performance  converges  to  the  performance  of  the  system 
which  has  a  priori  knowledge  of  the  signal  waveform  at  least  as  fast 
as  inversely  with  p^k,  the  average  number  of  learning  samples  which 
contain  a  signal  in  a  sequence  of  length  k. 

C.  EXAMPLE 

The  problem  of  detecting  a  signal  of  unknown  frequency  in  white 
gaussian  noise,  which  was  used  as  example  2  of  Chapter  II,  is  an  example 
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of  a  problem  in  which  the  system  performance  may  be  evaluated  by  the 
preceding  procedure.  In  this  example,  the  set  of  possible  signals  is 


S. 

1 


'  a  cos  (a),  t  +  $ ) 
'  1 


0  S  t.  g  T 


elsewhere 


(3.14) 


where  <t>  =  a  uniformly  distributed  random  variable 

a  =  a  Rayleigh-distributed  random  variable  with  parameter  A 

In  this  case  all  of  the  preceding  conditions  are  met.  We  identify 


R  = 


2N  W 
o 


(3.15) 


where  Nq/2  =  noise  spectral  density 

W  =  total  band  to  be  searched 

and  we  can  compute  p^d*(s)|s^  using  standard  procedures  [Ref.  23, 
p.  173f f  ] ,  and  evaluate  using  Eq.  (3.13).  Parts  (a)  and  (b)  of 

Fig.  11  show  the  results  for  the  case  where  there  are  200  possible  fre¬ 
quencies  within  the  band  W.  Also  shown  are  the  following: 

1.  The  performance  of  the  optimum  system  given  a  priori  knowledge 

A  / 

of  S  for  L  =  1,  and  for  p^  =  1/2  as  a  function  of  the  signal- 
to-noise  ratio,  R. 

2.  The  performance  curve  of  p^d*(s)|sj  shifted  by  3  db  on  the  R 
axis . 

3.  Bounds  for  the  performance  of  the  learning  receiver  for  1,000  and 
10,000  samples. 

'4.  The  performance  of  a  near-optimum,  200-channel  receiver  (Wainstein 
and  Zubakov  "type  III”  receiver  [Ref.  23,  p.  300ff])  which  does 
not  learn. 

It  is  clear  that  for  case  3  above,  the  incremental  risk  introduced  by 
lack  of  knowledge  of  the  signal  frequency  is  very  small  after  10,000 
learning  observations,  and  that  it  is  not  much  diff erent--af ter  1,000 
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FIG.  11.  PERFORMANCE  BOUNDS  FOR  A  LEARNING  SYSTEM. 


observations--f rom  the  incremental  risk  introduced  by  doubling  the  noise 
power  when  the  frequency  is  known'.  It  is  also  clear  that  for  almost  any 
task  the  nonlearning  receiver  would  be  virtually  useless  at  the  signal- 
to-noise  ratios  shown. 

D.  OTHER  TECHNIQUES  TO  OBTAIN  PERFORMANCE  BOUNDS 

A  second  technique  to  obtain  bounds  on  system  performance  may  be 
based  on  the  use  of  Chernoff  bounds  for  the  tail  probability  of  a  sum 
of  random  variables.  This  technique  is  applicable  to  Bayes  optimum 
systems  which  learn  either  with  or  without  a  teacher,  since  they  may 
both  be  described  by 

\  =  ^(xJ\-i)  -  / ^(xJe)  p(0I\.1)  de  (a.ie) 

This  sequence  of  likelihoods  ( &  ;  i  =  1 , 2, . , . }  is  a  martingale 
sequence  as  shown  in  Appendix  C.  It  may  be  centered  at  its  expectation 
by  considering  the  "gain"  at  each  new  observation.  Let  y  =  jj  -  &  , 

then 

k 

‘k - i  l 

i=l 

Shannon  [Ref.  27]  has  applied  Chernoff 1 s  bounding  technique  to  such 
martingale  sequences,  and  his  work  is  almost  directly  applicable  in  this 
case . 

Bounds  on  the  tail  probabilities  may  be  written  as  inequalities 

involving  bounds  on  the  semi-invariant  generating  functions  for  the 

martingale  sequence.  For  example,  we  may  let  v  (u[h.)  be  the  moment- 

k  j 

generating  function  for  .  conditioned  on  H.  being  true  for  X, 

k  j  k 

(H^  =  hypothesis  that  signal  is  present;  H^  =  hypothesis  that  signal 
is  absent).  That  is, 

■k(u,H.)  =  j...  ^  exp  (uZy.)  dP  (y^ ,  y^,  .  . . ,  |h  . )  (3.17) 
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Now  define  bounding  functions  by  the  relations 


(3.18a) 
(3.18b) 

(it(u)  =  log  7a(u)  (3.19a) 

^(ulHj)  =  log  7k(u|Hj)  (3.19b) 

Suppose  that  we  can  find  a  single  bound  for  all  i  §  k  -  1,  and  call 
this  p^(u).  Then  we  can  show  that,  for  some  a,  b  >  0,  real: 

Pr  Uk  §  (k-l)^(u)  +  p^(u|H.)}  S  exp  ((k-l)[po(u)  -  up^(u)] 

+  M-k ( u  I  Hj  )  -  upk(u  I Hj  ) }  (3.20a) 

for  0  §  u  §  b,  and 

Pr  UR  S  (k-l)p^(u)  +  pk(u|Hj)}  g  exp  ((k-l)[pQ(u)  -  u^i'(u)] 

+  Hk(u|H.)  -  up^ulHj))  (3.20b) 

for  -a  t:  u  g  0. 

One  technique  for  finding  a  suitable  dQ(u)  is  to  find  two  cumulative 

distribution  functions  <t,(v)  and  ^(v)  which  bound  P(v.|v.  v„  ) 

1  2'  ix -1  1 

above  and  below  for  all  v. .  We  then  choose  $  (v)  such  that 

‘  x  o' ‘  ' 


7i(u)  g  J  exp  (uyi)  dP  (y±  j  yi_1,  .  .  . , 
c(u|H.)  g  Jexp  (uyk)  dP  (yklyk_1 . Y^) 


and 
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y  g  a 


(1)  <t>Q(y)  =  <t>1(y) 

(2)  4>q( y )  =  <t>2( y )  y  S  p 

(3.2l) 

(3)  4>o(y)  =  <t>1(a)  =  <t>2(p)  a  g  y  g  P 

(4)  /  y  d<t>o(y)  =  0 

Define 

M-0(u)  =  log  f  eUy  d<t>o(y)  (3.22) 

Then  pQ(u)  is  a  bound  of  the  desired  type.  We  can  use  this  same 

approach  to  bound  p  (u|h.)  by  conditioning  4>  (  y  j  H  . )  and  _  ( y  |  H  ) 

h  J  1  j  2  j 

on  H..  Unfortunately  at  this  point  the  technique  requires  specification 
J 

of  the  particular  problem  in  more  detail;  i.e.,  the  distributions  of 
lim  £  and  £  must  be  specified.  Although  this  is  in  general  possible, 

kK  O 

->■-« 

for  the  case  of  detection  of  an  unknown  signal  in  gaussian  noise,  both 
distributions  are  log-normal  and  the  moment-generating  functions  do  not 
exist  for  any  u  interval.  This  problem  may  be  overcome  by  noting  that 
any  practical  system  to  compute  has  a  finite  dynamic  range,  and  by 

truncating  the  distribution  at  this  limit.  Such  truncation  makes  evalua¬ 
tion  of  the  bounds  very  difficult.  Numerical  solutions  may  of  course  be 
found  for  any  particular  problem  by  means  of  a  computer  solution;  however 
the  results  can  only  be  expressed  numerically  and  will  most  likely  shed 
little  light  on  the  question  of  performance  in  general. 

There  are  two  other  methods  for  determining  system  performance  which 
should  be  considered  by  anyone  setting  out  to  decide  whether  or  not  a 
learning  system  is  worth  the  cost  in  time,  complexity,  and  money  for  any 
particular  problem.  These  methods  involve  either  a  direct  evaluation  of 
the  cumulative  probability  distribution  of  t ( X i )  from  the  known 
statistics  of  X  and  A  or  a  determination  of  the  statistics  of 
>.  (x|/\^)  by  simulation  of  the  system  and  design  of  an  experiment  to  deter¬ 
mine  the  desired  performance  measures.  Both  of  these  approaches  seem  to 
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be  complex  lor  reasonably  large  k;  however  in  certain  problems,  particu¬ 
larly  where  convergence  is  rapid,  either  approach  may  be  useful. 
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IV,  LEARNING  TIME-VARYING  PARAMETERS 


In  the  previous  chapters  a  technique  has  been  developed  which  will 
allow  the  synthesis  of  systems  which  learn  without  a  teacher  when  the 
unknown  parameter  is  fixed.  This  technique  was  accomplished  by  treating 
the  unknown  parameter  as  a  random  variable.  In  this  chapter  the  same 
problem  will  be  examined  for  the  case  where  the  unknown  parameter  is  not 
fixed,  but  varies  with  time.  A  synthesis  technique  for  systems  to  solve 
this  problem  will  be  developed  by  taking  an  approach  similar  to  the 
previous  one  and  treating  the  unknown  parameter  as  a  random  variable 
which  is  time  varying. 

A.  MODELS  FOR  THE  TIME-VARYING  PARAMETER  PROBLEM 

As  in  Chapter  II  the  problem  considered  will  be  the  binary  decision 

problem  phrased  in  terms  of  detection  of  a  signal  which  depends  upon  a 

set  of  unknown  parameters.  The  results  may  be  generalized  to  obtain  the 

extension  to  multiple-hypothesis  testing  as  in  Chapter  II. 

The  data  to  be  used  consist  of  an  observation  X^  and  a  learning 

sequence  =  X^X^,  ...,Xk  )  .  Each  observation  contains  a  signal 

corrupted  by  noise,  or  it  contains  noise  alone,  and  it  is  desired  to 

t  h, 

synthesize  a  system  to  decide  whether  or  not  the  k  observation  (x^) 

contains  a  signal.  The  Bayes -optimum  system  making  optimal  use  of  the 

learning  sequence  is  required.  This  problem  differs  from  the  problem  of 

Chapter  II  in  this  sense:  the  values  of  the  unknown  signal  parameters 

are  not  the  same  from  observation  to  observation.  This  fact  is  indicated 

by  indexing  the  parameter  set  with  a  lowercase  letter  "c";  i.e.,  the 

signal  parameters  defining  the  signal  present  (if  any)  in  the  current 

observation  (X,  )  are  designated  0  . 

k  c 

Formally,  we  let 

=  hypothesis  that  X^_  =  s(6c)  ©  N 

H0  =  hypothesis  that  X  =  N 
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where  S(0  )  is  the  current  signal  vector  (unknown  parameters)  and  N 
is  the  noise  vector. 

For  a  particular  problem  the  statistical  nature  of  the  noise  and  the 
corrupting  operation  are  assumed  to  be  known  so  that  the  only  unknowns 
are  the  signal  parameters.  In  order  to  solve  the  problem  a  statistical 
model  of  the  signal-parameter  variations  from  observation  to  observation 
is  required.  The  statistical  model  must  include  a  description  of  the 
way  in  which  the  current  values  of  the  parameters  depend  upon  past  values, 
and  a  description  of  the  statistics  of  the  times  of  occurrence  of  changes."^ 
The  former  description  will  be  called  "value  dependence"  and  the  latter 
"time  dependence. " 

The  value  dependence  of  the  signal  parameters  may  be  described  by 
the  probability  density  of  the  c  realization  of  the  signal  condi¬ 
tioned  on  all  of  the  past  realizations: 


P(®, 


'ec-l’ec-2’ 


V 


In  some  problems,  particularly  the  frequency-hopping  signal  reconnaissance 

th 

problem  to  be  described  in  example  2,  the  c  realization  will  be 
independent  of  the  past,  so  that 


p(9 


[0c-l’ec-2> 


’01>  = 


P  (6  ) 
o'  c' 


(4.1) 


In  other  problems  the  dependence  may  be  Markov  so  that 


P(8, 


|0c-1,ec-2’ 


•V  - 


(4.2) 


In  yet  other  problems  the  entire  past  may  enter;  however,  these  problems 
lead  to  systems  which  grow  in  size  with  k.  For  this  reason  the  value 
dependence  will  be  restricted  to  be  at  worst  M^-order  Markov. 


Throughout  this  chapter  it  is  assumed  that  changes  in  parameter  value 
can  take  place  only  at  a  (countable)  set  of  discrete  instants  in  time. 
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The  time  dependence  cannot  be  described  as  generally  as  the  value 
dependence;  however  there  are  two  types  of  time  dependence  which  are  of 
particular  interest  because  they  occur  frequently  in  physical  problems. 

The  first  type  will  be  designated  the  "general  random  walk, "  since  we 
assume  that  a  change  takes  place  at  the  start  of  each  observation.  The 
amount  of  change,  as  well  as  the  direction,  depends  on  the  past  history 
and  is  described  by  p( 0  | 0  ,9  ,...,9  ).  An  example  of  an  unknown 

time-varying  parameter  which  may  be  approximated  by  this  model  is  the 
complex  gain  of  a  communication  channel  which  is  slowly  varying  with 
respect  to  the  duration  of  one  signal  (see  example  1  of  this  chapter). 

The  second  type  of  time  dependence  will  be  designated  a  "binomial," 
dependence.  In  this  model  the  changes  in  the  parameter  occur  at  moments 
which  coincide  with  the  start  of  an  observation,  but  changes  do  not 
occur  at  each  new  observation.  The  probability  that  n  changes  will 
occur  in  j  trials  is  the  binomial  distribution, 

Pn(j)  =  (j)pJ(l  *  p)n_J  (4.3) 

where  p  is  the  probability  of  a  change  in  one  trial.  Once  again  the 

value  dependence  is  described  by  the  conditional  density  p(0  0  .,0 

c  c-1  C”2* 

. . . , 8  ).  An  example  of  a  parameter  which  has  a  "binomial"  time  depen¬ 

dence  is  the  frequency  of  a  frequency-hopping  signal  as  explained  in 
example  2  of  this  chapter. 

B.  SOLUTION  TO  THE  PROBLEM 

In  order  to  obtain  a  solution  to  the  learning  problem,  0^  is  treated 

as  a  random  variable  and  the  a  posteriori  distribution  of  0  is 

c 

learned.  As  before,  the  Bayes  system  will  compute  the  conditional  likeli¬ 
hood  ratio 


de 


(4.4) 
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and  compare  .it  to  the  appropriate  threshold.  Since  it  is  assumed  that 
^(XrIGc)  has  known  form,  the  problem  again  reduces  to  the  computation 
of  p(0  | A.  .).  In  order  to  compute  this  function,  a  time-dependence 
model  must  be  given;  hence  the  problem  may  be  treated  for  two  cases: 

1 .  Case  1,  General -Random-Walk  Time  Dependence 
a.  General  Solution 

In  this  case  the  index  c  will  coincide  with  k  since  0 
will  change  with  each  new  observation;  therefore, 


k-1 


P^Gki\-l^  = 


p(0il8i-i . eiy|  n  P^iKy  de^  ...  dex 


(4.5) 


th 

Thus  if  the  value  dependence  is  not  at  least  as  simple  as  M  -order 
Markov  [i.e.,  if  p(  8^ }  -l '  •  •  • ,  Gf)  may  not  be  written  as  P(®jJ0k-l' 
..,,0.  )],  then  the  system  to  compute  p(G.  ]A.  ,)  must  grow  in  size 

K  “JY1  K  K-1 

linearly  with  k.  The  complexity  of  the  system  would  grow  much  more 

rapidly.  This  is  the  reason  for  restriction  of  the  statistics  describing 

th 

the  value  dependence  of  the  parameter  to  be  M  -order  Markov  with  M 
finite. 


b.  First-Order  Markov  Value  Dependence- -Vector  Parameters 

For  simplicity  in  obtaining  a  system  from  this  equation, 
assume  that  the  value  dependence  of  the  parameter  is  first-order  Markov 
so  that 


p<eJAk-i> "  /p<6ki(W  dek-i 


/ 


P^ek'ek-1')  p(X  | a"  )  P^ek-l^\-2^  d0k-l  ^4-6^ 

K  —  X  K  '“Ci 
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Equation  (4.6)  illustrates  once  again  the  recursive  nature  of  the  compu¬ 
tations;  that  is,  once  P(ek.1l\.2)>  p^Xk-l  I  Sk-1  ^ '  and  P^Xk-l^k-2^ 
are  computed  (all  of  these  quantities  will  be  computed  during  the  previous 
observation-decision  cycle),  p(®k^k  1^  Can  comPutec*' 

Equation  (4.6)  may  be  rewritten  in  terms  of  likelihood  ratios 

as  follows: 


^k-l^k-l)  + 

.^Xk-J\-2^  +“ 


P^k-JAk-2)  d9k-l 


(4.7) 


where  as  before  a  =  p(H^)/p(H  ).  After  rewriting  Eq.  (4.4)  as  shown 
below,  the-  required  system  may  be  conveniently  synthesized. 


M 


■  /«xJek> 


^Xk-A-i>  +g~ 

,^(Xk-ll\-2^  +  a. 


P^0k-l^k-2^  d6k-l  d0l 


(4.8) 


From  (4.7)  and  (4.8)  the  system  shown  in  Fig.  12a  may  be 
synthesized.  This  system  operates  in  a  manner  similar  to  that  described 
in  Chapter  II.  It  performs  three  operations: 

1.  Compute  *(Xk''k^  for  each  possible  0 

2.  Compute  p  ( 0  |  A.  )  for  each  possible  G 

K  K  "  1 

3.  Weight  (l)  by  (2),  and  sum  over  all  0. 

It  is  in  the  performance  of  the  second  operation  that  the  time  variation 
of  0  is  taken  into  account  by  including  the  three  components  [the 
p ( 0 k I  0 k  generator,  the  multiplier,  and  the  integrator]  in  the  prob¬ 

ability  loop.  The  >.  (X^f)  computer  must  "sweep”  through  all  values  of 
t  ,  and  the  p(  1  *£-  )  generator  must  "sweep"  through  all  combinations 

K  K  “•  J. 

of  values  of  6  and  6  .  If  there  are  only  a  finite  number  of  values 

K  K  - 1 

of  0,  the  system  may  be  realized  in  a  parallel  form  by  utilizing  a  set 
of  parallel  computers  i  =  1,2,  ,..J,  where  i  indexes  the 

possible  values  of  . ,  In  this  case  the  two  integrators  are  replaced  by 
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a.  Case  lb 


(Xk|xk_|) 


T  =  TIME  OF  ONE  OBSERVATION 

T  =  TIME  TO  SWEEP  THROUGH  ALL  8 

ASSUME  ALL  0,  ,  0,  ,  SWEPT  IN  2T  <  T 
k  k-1 

b.  Case  lc 

FIG.  12.  LEARNING  SYSTEM  FOR  GENERAL -RANDOM -WALK  TIME  DEPENDENCE. 


summers  with  inputs  from  the  parallel  circuitry.  A  block  diagram  of 
such  a  general  parallel  system  is  difficult  to  draw;  however  the  parallel 
form  is  used  in  the  solution  of  example  2  later  in  this  chapter. 

c.  First-Order  Markov  Value  Dependence--Scalar  Parameters 

If  the  unknown  parameter  is  scalar  and  has  a  random-walk  type 
of  dependence  on  the  past,  the  system  of  Fig.  12a  simplifies  somewhat. 
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In  this  case,  we  may  represent  0^_  as  a  perturbation  of  9^_  ^ ;  i.e., 

let 


e 


k 


'k-1 


+  *k 


(4.9) 


where 


A^  is  independent  of  0^  .  Let  the  distribution  of  be 


p  («).  Then  a  simple  transformation  will  provide  Eq.  (4.10). 

^6kl0k-l^  =  PA(0k  '  0k-l> 

Equations  (4.7)  and  (4.8)  may  be  rewritten  as 


’4.10) 


P^k^-l^  fP A^0k  "  8k-l^ 


'^Xk-ll9k-l)  +  Q 
^<Xk-llAk-2}  +“ 


p(0k-JAk-2^  d0k-l 


(4.11) 


pA(0k  '  ek-l> 

5Xk-llAk-2>  +  aJ 

:>  dek-l  d6k 

(4.12) 


When  0  is  a  scalar,  the  system  can  be  realized  by  sweeping  through  the 
range  of  6  in  some  interval  X  (which  must  be  less  than  half  the  obser¬ 
vation  interval  T/2  for  real-time  operation).  In  this  case  0^  and 
6  are  two  different  time  variables,  and  Eq.  (4.1l)  represents  a  con- 

volution.  For  this  reason,  the  system  may  be  realized  as  shown  in  Fig. 
12b,  where  the  only  difference  from  the  system  with  fixed  parameter  is 
the  filter  with  impulse  response  p^(t)  in  the  probability  loop.  In 
order  to  insure  that  this  filter  may  be  realized,  the  delay  of  X  in 
the  forward  loop  has  been  added.  This  concept  of  the  filter  p^(t)  will 
be  useful  in  the  solution  of  the  second  example. 
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If  the  parameter  value  dependence  is  first-order  Markov,  but 
not  representable  as  an  independent  perturbation,  then  p( ©  | ©  )  will 

not  lead  to  a  time-invariant  filter  as  above.  Instead  it  will  lead  to 
a  time-varying  filter  as  a  replacement  for  the  filter  with  impulse 
response  p^(t)  in  Fig-  12b.  The  replacement  will  have  a  time-varying 
impulse  response 

h(t,r)  =  p(ek  =  t I ©k_1  =  y)  (4.13) 

The  output  of  this  filter  at  time  t,  for  an  input  z(t),  is  defined  by 

r 

e0(t)  =  I  h(t,7)  z(y)  d7  (4.14) 

J  - 00 

Methods  for  the  realization  of  such  filters  are  beyond  the  scope  of  this 
study;  however  one  method  for  a  particular  form  of  h(t,y)  is  suggested 
in  example  1.  For  other  methods  see  Refs.  25,  26. 
th 

d.  M  -Order  Markov  Value  Dependence- -Vector  Parameters 

th 

If  the  unknown  parameter  is  M  -order  Markov,  the  same 
general  approach  to  system  synthesis  may  be  taken.  Equation  (4.5) 
becomes 


p(ekK.i)  = 


p(0J  9k-l’ * ’ • ,0k-M^  d9k-l 


de 


k-M 


(4.15) 


The  change  which  this  requires  in  the  block  diagram  is  simple 
enough;  however,  the  complexity  of  the  system,  even  for  scalar  9, 
rapidly  becomes  intolerable.  To  see  that  this  is  true,  assume  that  9 
is  scalar  and  M  =  2.  Then  somewhere  within  the  system,  the  function 
p(eje,  , ,  9,  „)  must  be  stored  for  all  possible  combinations  of  9,., 

K  K - 1  ■  K 
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0  and  0  .  If  there  are,  say,  N  possible  values  which  9  may 

k-1  k-2  3 

take  on,  the  storage  required  is  N  .  In  general,  the  storage  increases 
as  N^. 

2 .  Case  2,  Binomial  Time  Dependence 
a.  General 

In  the  case  of  binomial  time  dependence  the  integral-valued 
variable  j  is  defined  as  follows: 

j  =  number  of  observations  since  the  last  change  in  0 

Then  since  the  changes  occur  at  moments  which  are  binomially  distributed, 
j  will  be  exponentially  distributed  as  follows: 

P(j)  =  pCl-p)3"1  (4.16) 

where  p  is  the  probability  of  a  change  in  0. 

In  order  to  obtain  a  recursive  relation  from  which  a  system 
may  be  synthesized,  the  distribution  of  0  is  conditioned  on  j  as 
well  as  on  the  last  value  of  0;  i.e., 


£ p(j)  p(8Jj’W 

j 


(4.17) 


But 


pO 


P(\-1 


ek.j) 


j’Ak-l}  =  p(e*|j) 


(4.18a) 


or 
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k-1 


n  p^xi i ec>  p(ecij) 

p(eci  ^ - 


k-i 


Also, 


n  p(xii\_i)  p^k-j-jj) 

k-j 


p(e  | A,  .  , ,  j) 
c  k-j-1 


k-l 

TT 

'  P(xt  ec)  ' 

il 

_p(x. lAi_1)_ 

k-j 

k-l 

TT  . 

i(x1|ec)  +  . 

k-j 


(4.18b) 


pOjVj.rJ)  =  /p(eclec-i’\-j.i’j)  p(ec.1l\.j.r j)  dec.1  (4-19) 
Hence  (4.17)  may  be  rewritten  as 


p(ecl\-i)  = 


k-l 

S  p(j)  n 

j  k-j 


'  £(x.|ec)  +  a  ' 
e(xi | _x )  +  a 


/p(0cl0c-i'\-j-rj) 


calculated 

recursive. 


•  p(ec-A-j-rJ) 


ae 


c -1 


( 4 . 20 ) 


Note  that  p(^c  ^ | ^.j)  is  the  value  of 
j  observations  ago,  so  that  Eq.  (4.20)  is  in  some  sense 
This  fact  will  be  exploited  in  the  following  paragraphs. 
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b.  First-Order  Markov  Value  Dependence--Vector  Parameters 

In  order  to  interpret  Eq .  (4.20)  as  a  system  block  diagram, 
the  problem  is  simplified  by  assuming  that  the  value  dependence  of  9 
is  first-order  Markov.  In  this  case 


p(eJAk-j-i'J)  =  /p^A-i* p(e 


c-l '  j  -1’ 


i)  de 


C-l 


(4.21) 


Call  this  P,  . 

k-j-1 


In  order  to  rewrite  Eq.  (4.20),  denote 


jB(xilei)  +  a 
Li  =  ^(xi|A1.1)  +  ct 


(4.22) 


Then  Eq.  (4,20)  may  be  expanded  and  written  as 


*  P<1>  H 


-lPk-2 


(4.23) 


This  function  is  recursive  in  the  sense  that  once  L,  ,  and 

k-1 

P  are  available,  L  and  P  may  be  computed  from  X  and  X  . 

K  —  £  K  K  K  K  —  X 

A  system  to  realize  this  computation  in  delay-feedback  form  is  shown  in 
Fig.  13. 

c.  Independent  Values --Vector  Parameters 

When  the  value  of  9  is  independent  of  the  past  values  of 

c 

9  (which  will  occur,  for  example,  in  the  frequency-hopping  problem), 

Eq.  (4.23)  simplifies  and  the  resulting  system  is  more  manageable.  In 
this  case, 


p(-c,ec-l}  =  Po(ec} 


(4.24) 
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hence 


=  po(0 


c>  P(l)  Lk-1 


(4.25) 


A  computer  for  this  equation  may  be  realized  as  in  Fig, 


14. 


33381 

p  =  PROBABILITY  OF  A  CHANGE  IN  0 
q  =  1  -  P  =  P(2)/P(l) 

FIG.  13.  PROBABILITY  COMPUTER,  CASE  2b. 


FIG.  14.  PROBABILITY  COMPUTER, 
CASE  2c. 


p(  0C|  ^k-l^ 


33368 


57 


SEL-65-011 


d.  Independent  Values--Scalar  Parameters 

When  the  unknown  parameter  is  scalar  (such  as  in  the  frequency¬ 
hopping  problem),  the  system  to  compute  i(Xk|Ak_1)  may  be  realized  in 
the  sweeping  form  shown  in  Fig.  15. 


33367 


FIG.  15.  LEARNING  SYSTEM,  CASE  2c. 


C.  EXAMPLES 

In  order  to  demonstrate  the  utility  of  this  synthesis  technique  when 
applied  to  problems  in  which  the  unknown  parameter  is  time  varying,  con¬ 
sider  the  solution  to  the  problems  described  briefly  below. 

1 .  The  Fading-Channel  Problem 

In  order  to  obtain  a  system  which  will  be  simple  enough  to  illus¬ 
trate  the  application  of  the  foregoing  technique,  and  at  the  same  time 
realistic  enough  to  demonstrate  the  utility  of  this  technique,  we  shall 
utilize  the  following  mathematical  model  of  a  data  link  using  on-off 
keying  for  binary-coded  transmission  of  data  through  a  fading  channel.^ 
Figure  16  illustrates  the  channel. 


The  channel  model  used  is,  according  to  Turin  [Ref.  28],  representative 
of  propagation  through  the  ionosphere  above  the  MUF,  or  through  the 
troposphere. 
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SIGNAL  INPUT 
s(t) 

ADDITIVE  NOISE 
n(t) 

33366 

FIG,  16.  THE  FADING-CHANNEL  MODEL. 

a.  On-Off  Keyed  (OOK)  Signals 

(1)  Signal .  The  information  is  transmitted  as  a  sequence 
of  marks  and  spaces.  The  signal  is  on  for  a  duration  T  when  a  mark 
is  being  transmitted,  and  off  for  a  duration  T  when  a  space  is  beJ:-;g 
transmitted.  When  the  signal  is  on,  it  has  the  form 

s(t)  =  He  (s(t)  exp  (jo^t)) 

where  0)q  is  known;  s(t)  is  a  known,  real,  lowpass  modulation  wave¬ 
form  of  duration  T;  and  Re  denotes  "real  part  of." 

(2)  Channel .  The  "nonselective,  slow-fading"  channel  model 
used  by  Turin  [Ref.  28]^  will  be  assumed.  This  channel  is  represented 
best  by  its  operation  on  the  signal.  The  channel  output  y(t)  may  be 
represented  as 

y(t)  =  R'e  (gs(t  -  t)  exp  [j(cuQt  -  <t> ) ] } 

(Thus  by  ignoring  the  modulation  delay  T,  we  may  think  of  the  channel 
as  a  multiplicative  medium  with  constant  G  =  ge  ^ ) .  The  medium  is 
characterized  by  the  three  quantities:  g,  the  attenuation;  T,  the 
modulation  delay;  and  <t ,  the  carrier  phase  shift.  We  assume  that  T 
is  known  to  the  receiver,  g  is  Rayleigh  distributed  and  is  uniformly 
distributed  over  the  interval  0  to  2n.  The  channel  is  assumed  to  vary 

Turin  discusses  this  model  and  the  physical  justification  in  considerable 
detail  and  therefore  no  attempt  is  made  here  to  repeat  his  discussion. 


FADING 

y(t ) 

CHANNEL 

RECEIVED  SIGNAL 
. x(t) 
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slowly  so  that  g  and  $  may  be  treated  as  constants  over  at  least  one 
signal  duration  T.  More  detailed  time-variation  assumptions  will  be 
made  later  as  they  are  required. 

The  additive  noise  is  assumed  to  be  gaussian  with  con¬ 
stant  spectral  density  N^/2  over  the  narrow  band  of  interest.  Since 
n(t)  is  a  narrowband  gaussian  random  process  (NBGRP)"^"  it  may  be  written 
in  terms  of  a  complex  modulation  process  as 

n(t)  =  Re  (r\(t)  exp  (jcoQt)} 

where  t|(t)  is  a  lowpass,  complex,  gaussian  random  process  (GRP). 

(3)  Problem  Formulation.  The  problem  is  to  process  the 
received  waveform  in  a  manner  which  will  result  in  a  minimum  average 
risk  decision.  Because  g  is  Rayleigh  and  $  is  uniform,  the  quantity 
gs(t)e  ^  is  a  lowpass  complex  GRP,  and  the  quantity  gs(t)e  ^  +  r)(t) 
must  be  a  lowpass  GRP.  We  may  note  that  x(t)  under  either  hypothesis 
may  be  written  as  the  cissoid  exp  (ja>ot)  modulated  by  a  complex,  low- 
pass  GRP;  hence  x(t)  is  an  NBGRP. 

If  we  utilize  the  complex  notation 

x(t)  =  Re  (£(t)  exp  (jcuQt)} 

we  may  reformulate  the  hypothesis  in  terms  of  £(t)  as  follows: 

Hx  -  ;(t)  =  gs(t)  e  +  q(t) 

h2  -  s(0  =  ’l(t) 

Z 

A  comprehensive  discussion  of  the  properties  of  narrowband  gaussian 
random  processes  may  be  found  in  Refs.  25  and  26. 
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(We  note,  parenthetically,  that  we  may  obtain  £(t)  to  a  very  good 
approximation  from  x(t)  utilizing  the  following  equations: 

o  C ^ 

Re  (£(t)}  =  -  I  x(t)  cos  (0Qt  dt  [=  x(t)] 

-'c-a 

/- 1 

5m  (£(t)}  =  -  /  x(t)  sin  ®ot  dt  [=  x(t)] 

“'t-a 

where  a  is  short  compared  to  time  variations  in  s(t)  and  long  compared 
to  variations  in  cos  ODQt .  We  denote  these  two  real  quantities  by  x(t) 
and  x(t)  respectively,) 

It  is  shown  in  Ref.  26  that  the  real  and  imaginary  parts 
of  the  lowpass  complex  envelope  of  an  NBGRP  are  independent  if  they  have 
symmetric  spectral  distribution;  hence  x(t)  and  x(t)  are  independent 
GRP's  if  we  assume  that  the  spectrum  of  the  fading  medium  meets  these 
requirements , 

For  brevity  we  denote  by  (v)  the  real  part  and  by  (~) 
the  imaginary  part.  We  note  that  s(t)  may  be  considered  to  be  zero  for 
on-off  keyed  signals,  and  denote 


g  =  ’-e  ge  3 
g  =  im  ge  ^ 

We  may  identify  g  and  g  as  the  in-phase  and  the  quadrature  channel 
gains.  Then  when  H  is  true, 

x(t)  =  gs(t)  +  q(t) 

x(t)  =  gs(t)  +  -|(t) 

We  represent  the  kth  observation  of  x(t)  and  x(t) 

as  the  column  vectors  X,  and  X,  which  have  as  their  rows  the  2TW 

k  k 
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samples  of  x(t)  and  x(t),  [(k-l)T  S  t  <  kT],  sampled  at  the  rate 

2W  samples  per  second  [w  is  the  bandwidth  of  the  envelope  s(t)]. 

Then  the  likelihood  ratio,  conditioned  on  g^  and  g^,  may  be  written 
as  in  Eq.  (4.26)  (g  and  g  are  the  values  of  the  unknown  parameters 

v  ~  ^  th  ^ 

g  and  g  during  the  k  observation) . 


n  /  IV  'V  \  r  /'S  ~  V 

1<Xklek’eK)  =  ^  Xk’ Xk I ®k' ^k  ’ 


-<Xk'Xkl»k^k-Hl) 

p(\,\Uk^k'H2> 


(4.26) 


But  X  and  X  are  independent  when  H  is  true,  g  and  g  are 

K  K  ^  K  K 

v  ~  ~ 

independent,  X  does  not  depend  on  g  ,  and  X  does  not  depend  on 

K  K  K 

V 

g^;  therefore  we  have 


«VXkl«k.*k> 


p(*kig„.y 

p(\IJk.H2)  P(XkISk,H2) 


=  <(xklik)  «Xkl£k> 


(4.27) 


where,  due  to  the  normality  of  the  noise, 


^ ( xk ! »k )  =  exP 


‘gk  gk 

2jTw  StS  +  fT~W  Stxk| 
o  o 


\<°> 

i - 

o 

<v 

L 

x  IM 

?  fi1! 

2/ M 

k\2W/ 

kl  2W  / 

\2W/ 

• 

’  Xk  = 

> 

>  s  = 

• 

x  /T  -  —) 

Z(T.±.\ 

:/T  .  _l\ 

k'T  2W ' 

L. 

k\T  2W / 

- 

\T  2W  / 
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The  fact  that  the  likelihood  ratio  factors  into  a  part  depending  on  g^ 
and  a  part  depending  on  g^  will  be  useful  in  the  synthesis  of  the  sys¬ 
tem  since  it  will  allow  synthesis  of  two  independent  systems,  one  to 

V 

learn  g^  and  one  to  learn  g^. 

The  optimum  system  computes 


"ik  dA  <4-28> 


thus  we  require  p(g  ,  g,  | A.  )  in  order  to  synthesize  the  system.  It 

k  k  k~l  , 

may  be  shown  that  gk  and  g^  are  conditionally  independent: 

p(gk’gklAk-l>  =  P^kl\.i)  P(Kkl\-!)  (4-29) 

where  A^  -  (X^X^  .  .  .  .X^) 

\-l  ~  ^X1,X2 . Xk-1 

Hence  the  system  may  be  synthesized  in  the  form  shown  in  Fig.  17. 


'The  left-hand  element  of  Eq.  (4.29)  can  be  written  as 

p(*k'*kl\-1) =  p(8kl«k'\-i)  p^kK-i) 

Since  knowing  the  value  of  7\  is  the  same  as  knowing  the  value  of 

Vi  and  \-v  we  may  replace  p^k^k^k-i)  by  p(*J*k’\-rV] 

Alternatively, 

•  V  v  lA/  »V  » 

/-  «  P^gk' \-r  gk’\-l^ 

P  gk ’ gk’ Ak-1 ’ Ak-1 '  _  ,Y  i~  y  x 

P  Ak-1  gk’ Ak-1 

Since  both  g  and  A  are  independent  of  g  and  A  ,  then 

*v  K  “  1  K  K  -  X 

p(gklgk-\-i)  =  P^k^k-i)  and  p(gkl\-i}  =  P^k^k-i) 
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a.  Fading  on-off  keyed  signals 


b.  Separator 


<«|5) 


33369 

FIG.  17. 


c.  Conditional  likelihood  computer 
LEARNING  RECEIVER  FOR  FADING  ON-OFF  KEYED  SIGNALS 
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To  determine  the  block  diagram  form  of  the  box  which 

computes  p(gkl\_i).  a  m°del  °f  the  time  variations  of  gk  and  gk 
is  required.  Two  possibilities  will  be  considered. 

Case  1:  The  first  and  simpler  of  the  two  models 
involves  the  assumption  that  the  fading  process  is  a  random-walk  process; 
that  is,  assume  that  changes  in  g  and  g  take  place  slowly  enough  so 
that  each  new  value  of  either  g  or  g  is  a  small  independent  pertur¬ 
bation  of  the  preceding  value,  so  that 


“  gk-l  +  Gk 


gk-l  +  Gk 


(4.30) 


where  e  is  independent  of  g  , ;  e  is  independent  of  g.jl  and 

k  K-l  A 

both  are  distributed  according  to  p^(z)  =  p~(z). 

In  this  case,  the  box  to  compute  P^jJAj^)  or 
p(gk|Ak_i)  may  be  realized  as  shown  in  Fig.  18a. 

Case  2:  The  second  model  is  more  involved,  and  allows 
the  correlation  between  present  and  past  values  of  g  and  g  to  be 
taken  into  account  by  treating  the  processes  as  M  -order  Markov  vari- 
ables.  To  be  specific,  assume  that  the  correlation  of  g  or  g 
decreases  exponentially  with  time  back  MT  sec,  and  then  becomes  zero. 

In  this  case 


. ^  ■  yfe exp  (•  i 


gk  +  2gkgk-lq  + 


v  v  i 


M  1 


+  2gkgk-iq  +  • • ’  +  2gkgk-Mq 


(4.31) 
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so  that  [see  Eq.  (4.15)] 


m 

p(*ki\-i}  =  /•••/  p(«ki«k-i . «k-M}  n  »^k.i|vi)  d*k-i  ••• 

1=1 


dg 


k-M 


M 


n  fp(K-i 


exp 


(4.32) 


The  system  shown  in  Fig,  18b  will  compute  this  function.  In  this  case 
it  is  necessary  to  utilize  M  time-varying,  linear  filters  h^,,.,,h„. 
These  filters  have  an  impulse  response 


^(t.gk)  =  exP 


(4.33) 


Such  time-varying  filters  can  be  realized  with  a  tapped  delay  line  of 
delay  length  2r  (where  T  is  the  time  required  to  sweep  through  the 

V  S/ 

range  of  g).  If  the  range  of  g  is  quantized  into  Q  levels,  the 
filter  will  require  2Q  taps,  as  shown  in  Fig.  19. 


=  exp  [-(nqV<j2)5] 

n  =  1,2 . 2Q 

6  =  INCREMENTAL  DELAY  =  t/q 


FIG.  19.  TAPPED -DELAY -LINE  REALIZATION  OF  TIME -VARYING  LINEAR  FILTER. 
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b,  Frequency-Shift  Keyed  (FSK)  Signals 

A  somewhat  more  complicated,  and  perhaps  more  useful,  example 
results  when  the  signal  model  is  modified  so  that  the  modulation  is 
frequency-shift  keying  instead  of  on-off  keying.  Such  a  signal  model  is 
described  below. 

(1)  Signal .  The  signal  is  on  continuously;  however  it  is 
shifted  between  two  frequencies  depending  upon  whether  a  mark  or  a  space 
is  being  transmitted.  This  shift  occurs  at  multiples  of  T.  During 
transmission  of  a  mark  signal,  s^(t)  is  transmitted;  and  during  a 
space,  s2(t)  is  transmitted,  where: 

s.^(t)  =  Re  (S  exp  (jco^t)}  0  S  t  §  T 

s2(t)  =  Re  (S  exp  (jo^t)}  0  g  t  S  T 

We  assume  that  and  a>2  are  chosen  so  that  the  signals  are  orthogonal 

over  the  interval  T;  i.e., 

T 

Sl(t)  s2(t)  dt  =  0  (4.34) 

(2)  Channel.  We  make  the  same  assumptions  concerning  the 
channel  as  for  the  on-off  keyed  signal.  In  this  case,  however,  there 
are  two  channels  of  interest,  one  at  ox.  and  the  other  at  cn_.  We 

JL  6 

assume  that  the  two  channels  fade  independently  and  that  the  multipli¬ 
cative  constants  G  =  g  exp  ( - j <t>  )  and  G  =  g  exp  (-j<t>  )  are 
1  1  JL  £ 

independent,  complex  gaussian  random  processes  with  symmetric  spectral 
distributions . 

(3)  Problem  Formulation.  In  this  case  there  are  two 

hypotheses : 

=  hypothesis  that  x(t)  =  G^s^t)  +  n(t) 

Hq  =  hypothesis  that  x(t)  =  G2s2(t)  +  n(t) 
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By  writing  x(t)  in  the  complex  modulation  form  at 
the  two  frequencies  and  taking  advantage  of  (l)  the  narrowband  nature  of 
the  processes  and  (2)  the  independence  of  the  channels,  it  is  readily 
shown  that  the  likelihood  ratio  factors.  Similarly,  the  joint  condi¬ 
tional  probability  density  p^  k,g1  k>g2  k,g2  kl\_1)  factors,  so 
that 


^Xkl\-l)  "  ^(Xl ,  k  l^'l,  k-1 )  ^Xl,  J\,k-1^  [-2(X2,kl^2,k-l^  ^X2,k^2,k-1^ 


(4.35) 


where 


^Xl,J\,k-l^  ^Xl,J®l,k^  p('®l,lJ\,k-l^  d^l,k 


1 

x< 

H* 

O 

_  J 

x±(0) 

V 

xi,k  = 

x.(A) 

<v» 

1  Xi,k  = 

x.(A) 

x.(T  -  A) 

x±(T  -  A) 

—  - 

(t) 


(t) 


cos  co^t  dt 


sin  cu^t  dt 


and  where 


*  ,  v»  v  v 

\,k  =  'Xi ,  1’ Xi,  2’  ‘ '  '  ,Xi, k 

'V  .  »v  -v 

Ai,k  =  Xi , 1’ Xi , 2’ ' ‘ ’ Xi , k 
i  =  1,2. 


) 

) 
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Thus  the  solution  to  the  independent  fading-channel 
problem  when  FSK  modulation  is  used  is  the  ratio  of  the  output  of  two 
of  the  previous  systems.  (See  Fig.  20.) 


x(t) 


33371 


FIG.  20.  LEARNING  RECEIVER  FOR  FADING  FREQUENCY -SHI FT  KEYED  SIGNALS. 


2 .  Frequency -Hopping  Signal  Reconnaissance  Problem 

There  are  many  reconnaissance  problems  in  which  it  is  desired  to 
detect  the  presence  of  a  signal  with  unknown  or  randomly  time-varying 
parameters.  Such  problems  are  often  readily  solved  by  the  procedures 
outlined  above.  One  such  problem  involves  the  detection  of  a  frequency¬ 
hopping  signal  embedded  in  noise.  The  model  for  this  example  follows. 

a.  Signal 

The  signal  is  assumed  to  be  a  narrowband  signal  which  may  be 
represented  over  an  interval  of  duration  T  by  a  sample  function  of  a 
narrowband  gaussian  random  process  with  center  frequency  o„  which  is  an 
unknown,  time-varying  parameter.  Hence 
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s(t)  =  a  cos 


(cot  +  4> ) 


where  a  is  Rayleigh  distributed,  p(a)  =  (a/A2)  exp  (-a2/2A2),  and 
<!>  is  uniformly  distributed  over  the  interval  from  0  to  2n.  The  fre¬ 
quency  is  assumed  to  change  only  at  integral  multiples  of  the  interval 
T.  The  probability  of  a  change  in  frequency  is  p  «  1  independently 
of  when  the  last  change  occurred,  and  the  frequency  is  equally  likely  to 
change  to  any  value  within  a  specified  band  W. 

b.  Noise 


The  noise  is  normally  distributed,  with  constant  spectral 
density  N^/2  over  the  band  W. 

c.  Problem  Formulation 


The  problem  is  to  examine  intervals  (of  duration  T)  of  the 
received  waveform  and  to  make  a  signal -presence  decision  at  the  end  of 
each  interval;  hence  signal -present  and  signal -absent  hypotheses  are 
defined  as  in  example  1.  Because  the  unknown  variable  is  a  scalar  with 
zero-order  Markov  value  dependence  and  binomial  time  dependence,  the 
system  for  detection  must  take  the  form  of  the  system  of  Fig.  15,  with 
8  replaced  by  f.  To  complete  the  solution,  an  expression  for  ^(x|f) 
is  required.  This  expression  is  (see  Chapter  II ) 


*(x|f)  = 


exp 


2N  W 
o 


+  1 


|xtE(f)|: 


2N  W 
o 


+  1 


(4.36) 


where 


x(0) 

1 

x(A) 

exp  (j2«fA) 

X  = 

: 

,  E(f)  = 

• 

i 

<] 

-  H 

X 

_ 1 

exp  [j2*f(T  -  A)] 

x(t)  =  received  signal, 


sampling  interval 
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i  1 2 

The  quantity  [X^E(f)|  is  proportional  to  the  periodogram  of  the  input 
at  frequency  f,  which  in  turn  is  closely  related  to  the  spectral  density 
of  x(t)  at  f . ^  From  these  facts  it  may  be  shown  that  the  likelihood 
computer  (which  must  sweep  over  the  range  of  f)  consists  of  a  time- 
compressive  sweeping  spectrum  analyzer^"  followed  by  an  antilog  device 
and  an  amplifier,  as  shown  in  Fig.  21,  Here  the  sweeping  analyzer  must 
cover  the  band  W  in  the  time  T,  and  repeat  periodically. 


/<Xk|f) 


33372 


FIG.  21.  LIKELIHOOD  COMPUTER. 


A  receiver  of  this  nature  will  optimally  detect  frequency¬ 
hopping  signals  for  which  the  model  proposed  is  a  suitable  representation. 
Although  it  is  more  complicated  than  many  receivers,  such  an  adaptive 
receiver  should  not  be  particularly  difficult  to  construct. 

D.  SUMMARY  OF  CHAPTER  IV 

In  this  chapter  we  have  investigated  the  learning  problem  in  which 
the  unknown  parameter  is  time  varying.  By  utilizing  two  specific  models 
for  the  way  in  which  the  parameter  may  vary  in  time,  we  have  demonstrated 
that  the  same  techniques  which  are  applicable  to  the  solution  of  learning 
problems  when  the  parameter  is  fixed  are  applicable  when  the  parameter 
varies  in  time.  Furthermore,  through  the  use  of  two  examples,  we  have 
demonstrated  that  the  models  proposed  are  applicable  in  a  variety  of 
physical  situations. 


For  a  general  discussion  of  the  periodogram  see  Ref.  25;  for  a  discussion 
of  a  time-compressive  spectrum  analyzer  see  Ref.  29;  and  for  the  rela¬ 
tionship  between  spectral  analysis  and  the  periodogram  see  Ref.  30. 
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V.  SYSTEM  REALIZABILITY 


The  purpose  of  this  chapter  is  to  investigate  the  physical  real¬ 
izability  of  the  optimum  learning  systems  developed  in  the  previous 
chapters.  The  realizability  of  a  system  will  be  defined  in  terms  of  the 
number  of  elements  required  to  construct  the  system  rather  than  in  terms 
of  the  realizability  of  the  individual  elements.  A  system  which  requires 
a  finite  number  of  perfect  elements  such  as  amplifiers,  multipliers, 
adders,  storage  elements,  etc.,  will  be  considered  to  be  realizable. 

It  is  important  to  recall  that  very  few,  if  any,  mathematical  models 
are  exact  representations  of  a  physical  problem,  although  the  models  may 
be  accurate  enough  that  the  difference  between  physical  and  theoretically 
predicted  events  cannot  be  measured.  Such  models  are  considered  to  be 
adequate  representations  in  an  "engineering"  sense.  It  is  in  this 
engineering  sense  that  the  individual  elements  of  the  learning  systems 
are  physically  realizable,  and  it  is  in  this  sense  that  we  shall  demon¬ 
strate  the  realizablity  of  many  learning  systems. 

A.  SYSTEM  MEMORY  CAPACITY 

Learning  systems  extract  and  store  information  from  a  sequence  of 
observations.  They  are  useful  if  the  information  storage  required  is 
less  than  the  storage  required  to  store  the  observation  sequence.  In 
the  systems  developed  in  Chapters  II  and  IV  the  system  size  (number  of 
elements)  depends  directly  on  the  number  of  information-storage  elements 
required.  From  the  mathematical  description  of  the  systems,  it  is  clear 
that  the  information  stored  is  used  to  compute  p ( ©  f  A.  ) ;  thus  to  investi- 

K 

gate  system  size  we  investigate  the  memory  capacity  Mc  required  to  com¬ 
pute  p(@|A  ).  We  define  the  required  M  of  an  optimal  learning 

K  C 

machine  as  the  minimum  number  of  functions  $>.(A,  )  of  the  observation 

k' 

sequence  A^  which  must  be  stored  by  the  learning  machine. 

In  order  to  investigate  the  theoretical  information-storage  capacity 
required,  we  shall  examine  the  concept  of  necessary  and  sufficient 
(minimal  sufficient)  statistics,  and  the  dimensionality  of  the  linear 
space  spanned  by  these  statistics.  We  shall  utilize  the  definitions  of 
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Dynkin  [Ref.  31 ]  and  Grettenberg  [Ref.  32]  to  prove  the  following  state¬ 
ments: 

1.  The  system  which  computes  p( 0 | )  computes  a  minimal  sufficient 
statistic. 

2.  No  optimal  learning  system  may  be  constructed  with  a  memory  capacity 
less  than  the  memory  capacity  required  to  compute  p(@|Ak). 

3.  If  the  set  $  of  all  possible  values  of  the  unknown  parameter  9 
consists  of  Q  points  8^,  Gg,  •••,  6q,  the  memory  capacity  of 
an  optimal  learning  machine  is  less  than  or  equal  to  Q-l. 

We  shall  show  in  Sec.  C  that  in  many  learning  problems  a  discrete 
model  for  0  exists  which  is  adequate  in  an  engineering  sense. 

B.  MINIMAL  SUFFICIENT  STATISTICS 

Systems  to  solve  the  classification  problem  when  an  important  param¬ 
eter  is  unknown  must  extract  and  store  certain  information  from  a  sequence 
of  observations.  The  information  to  be  stored  is  that  which  will  allow 
the  selection  of  the  conditional  probability  distribution  p(x|g)  (from 
which  the  observation  X  was  drawn)  from  a  family  of  distributions 
indexed  by  0.  Systems  which  perform  this  selection  are  computing  func¬ 
tions  of  the  learning  observations  which  partition  the  observation  space 
into  a  set  of  decision  regions.  It  is  well  known  that  certain  functions 
of  the  learning  sequence  lead  to  Bayes  decision  regions  regardless  of 
the  loss  functions  and  a  priori  probabilities  [Ref.  33].  Such  func¬ 
tions  are  sufficient  to  make  a  minimum  average  risk  decision,  hence  they 
are  called  sufficient  statistics. 

Some  sufficient  statistics  are  more  desirable  to  compute  than  others 
because  they  require  the  storage  of  less  information.  Since  the  learning 
problem  under  study  requires  a  sufficient  statistic,  it  is  desirable  to 
choose  that  one  which  requires  the  least  information  storage.  A  function 
within  this  class  is  called  by  Dynkin  [Ref.  31 ]  a  necessary  and  sufficient 
statistic;  however  a  more  descriptive  name,  which  has  been  used  by 
Grettenberg  [Ref.  32],  is  a  "minimal  sufficient  statistic." 
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A  sufficient  statistic,  in  the  above  sense,  may  be  defined  as 
follows . 

Definition:  A  statistic  T(x)  is  sufficient  for  the  family 

{p(x|  0 )  :  9  e  3>}  if  and  only  if  p(x|g)  may  be  factored  as  follows 

p(x|e)  =  h(x)  f(T(x),o)  (5.1) 

where  h(x)  depends  only  on  X  and  f(T,9)  depends  on  X  only 
through  T. 

In  order  to  study  minimal  sufficient  statistics,  we  first  define 
these  functions  in  terms  of  functional  dependence  as  below. 

Definition:  A  sufficient  statistic  T^(x)  is  dependent  on  another 

sufficient  statistic  T,>(x)  if  T2^X1^  =  implies  T^X^)  = 

T  (X  ),  that  is,  if  T  (x)  may  be  written  as  a  function  which 

i  ^  -i 

depends  on  X  only  through  T2(x). 

Def ini tion:  A  minimal  sufficient  statistic  T(x)  is  a  sufficient 
statistic  which  depends  on  all  other  sufficient  statistics. 

From  these  definitions  it  is  clear  that  the  function  p(o[Ak)  is  a 

minimal  sufficient  statistic  for  the  family  (p(xj@)  :  0  £  ,  and  a 

sample  of  size  k.  That  is,  it  is  sufficient  because  p(X  , , . . , X.  | 0 ) 
may  be  factored  as 

,  p(e|x  ) 

»<xi . Ve)  =  TTTeT  p(xi . Xk  (5-2> 

and  it  is  minimal  since  it  depends  on  every  other  sufficient  statistic. 
To  show  this,  let  T(X  , ...,X  )  be  a  sufficient  statistic,  then 

JL  K 

p(x2 . xkle)  =  h(x1 . xk)  f(T(x1 . xk),e)  (5.3) 


The  concept  of  sufficient  statistics  is  only  interesting  when  we  observe 
more  than  one  sample.  We  may  define  minimal  sufficient  statistics  for 
the  sample  of  size  n  by  replacing  T(x)  by  T(x  , X^, . . . , X^)  in  each 
of  the  definitions  given. 
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so  that 


P(x1,...,xk|e)  po(e) 
/p(x  r...,xk|e)  po(e)  de 


h(x1,...,xk)  f(T(x1>...,xk),e)  pQ(e) 
h(x1,...,xk)  Jf(T(x1,...>xk),e)  Po(e)  de 

(5.4) 


Hence  p(e|A  )  depends  on  (X  , ...,X  }  only  through  T(x  , ...,X  ). 

K  1.  K  1  K 

Thus  the  optimal  learning  system  computes  a  minimal  sufficient 
statistic,  and  the  first  statement  of  Sec.  A  has  been  demonstrated.  To 
demonstrate  the  second  statement  we  proceed  as  follows. 

An  optimal  learning  machine  for  the  observation  sequence  A  and 

K 

the  family  { p( X | 0 )  :  6  f  <t>)  must  compute  a  sufficient  statistic  of 
Ak-  A  minimal  sufficient  statistic  of  Ak  is  a  many-one  transformation 
on  all  other  sufficient  statistics  (except  other  minimal  sufficient 
statistics)  because  it  is  functionally  dependent  on  all  other  sufficient 
statistics.  Thus  the  number  of  functionally  independent  functions  of  Ak 
which  must  be  computed  to  compute  a  minimal  sufficient  statistic  must  be 
minimal .  No  optimal  learning  machine  can  be  constructed  with  a  memory 
capacity  less  than  that  of  a  machine  which  computes  a  minimal  sufficient 
statistic . 

Finally,  we  shall  demonstrate  that  the  memory  capacity  Mc  of  a 
machine  to  compute  p(e|Ak)  is  finite  whenever  the  set  <t>  of  all  pos¬ 
sible  values  of  6  consists  of  Q  points  0^ , 0^, . . . , 0  ,  and  that 
M  §  Q-l.  The  function  p ( © | A,  )  may  be  written 

C  K 

Q 

p(e  I  \)  =  ^  ^(6)  g±(\)  (5.5) 

i=l 
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where 


But 


It(e) 


0  =  0, 


.0  elsewhere 


gi '\)  ”  P(0J\) 


p(eii\)  -  n  p(xJei)  p0(ei) 


j=i 


(5.6) 


Thus  it  is  sufficient  to  store  the  Q  functions 


k 

y  in  ptXjleJ 

j=l 


in  order  to  be  able  to  compute  p(9|Ak)  and  §  Q-l. 

If  we  have  the  case  where  the  functions  p(x | 0^ )  are  functionally 
independent,  then  it  is  necessary  to  store  the  Q  functions,  and  the 
inequality  (M^  g  Q-l)  becomes  an  equality  (M^  =  Q-l). 

It  is  clear  that  once  we  are  given  a  decision  problem  involving 
(p(x|8)  :  6  -  0) ,  we  may  readily  construct  a  finite-sized  system  so 
long  as  $  is  a  set  of  Q  points.  In  fact,  by  taking  advantage  of 
any  functional  dependence  which  may  exist  between  the  functions  p(x|9^), 
we  may  always  construct  a  system  which  requires  a  minimum  of  information 
storage  capacity. 
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The  problems  in  which  $>  does  not  consist  of  a  discrete  set  of 

points  may  often  lead  to  systems  in  which  M  is  not  finite.  Thus  the 

c 

systems  will  not  be  realizable. 

C.  PRACTICAL  CONSIDERATIONS 

We  have  just  noted  that  the  memory  size  of  the  learning  system  is 
finite  so  long  as  the  unknown  parameter  space  is  discrete,  and  that  in 
many  cases  of  interest  it  is  infinite  when  the  parameter  space  is  not 
discrete.  Since  it  is  not  usually  considered  possible  to  construct  sys¬ 
tems  with  infinite  memory  capacity,  we  may  draw  the  conclusion  that  we 
cannot  construct  the  theoretically  optimum  system  in  these  cases  and  can 
then  set  about  either  changing  the  theoretical  model,  or  looking  for  a 
suboptimum  finite  system. 

One  reasonable  way  in  which  to  modify  the  model  is  to  ask  for  the 
optimum  (Bayes)  system  under  a  finite  memory  constraint.  Such  an 
approach,  although  logical,  is  difficult  to  apply  to  the  learning  prob¬ 
lem  and  will  not  be  attempted  in  this  study. 

Instead  of  attempting  to  modify  the  model  we  may  find  it  more  useful 
to  examine  the  results  of  simply  using  the  model  to  synthesize  optimum 
systems,  and  then  to  approximate  these  systems  as  well  as  we  can.  Although 
this  approach  is  much  less  pleasing  mathematically,  it  has  the  advantage 
of  being  practical,  and  has  some  precedent  in  other  applied  decision- 
theory  fields. 

A  similar  situation  exists  whenever  we  represent  a  continuous  func¬ 
tion  x(t)  in  the  interval  (0,t)  by  its  sample  values  x(o) , x( t  ) ,  . . . , 
x(t^)  taken  in  this  interval.  In  ah  engineering  sense  for  some  large 
m  these  samples  adequately  specify  x(t)  :  t  ►'  (0,T);  however,  strictly 
speaking,  unless  m  -»  ■'o  this  is  only  an  approximation  [see  Ref.  20 ]. 

The  fact  that  in  most  cases  there  is  some  finite  set  of  discrete 
values  of  the  unknown  parameter  which  in  an  engineering  sense  represent 
all  of  the  usefully  distinguishable  values  that  the  parameter  may  assume 
is  stated  in  the  following  theorem.  A  system  based  on  this  set  of  pos¬ 
sible  parameter  values  requires  a  finite  (fixed)  memory  capacity. 
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Theorem  1 .  Designate  by  $  the  space  of  all  possible  values  of 


the  vector  parameter  0,  and  let  the  range  of  each  coordinate  of 
6  be  bounded.  Then  if  p(x|9,H  )  is  independent  of  9,  and 
p(x|0,H  )  is  a  continuous  function  of  9  for  all  9  e  $  and  all 
X,  there  exists  a  subset  of  say  0  =  (9 , 0 , . . . , 9n} ,  with 

a  finite  number,  Q,  of  discrete  values  of  9  such  that  for  any 

A 

e  >  0  and  all  9  e  4  there  is  a  9  e  <J>  which  satisfies 

q  Q 

p(d*(Gq)|e)  S  p(d*(9)|9)  +  e 


where  p 


(d*(9q)|e) 


is  the  average  risk  of  the  Bayes  decision  rule 


based  on  the  assumption  that  9  is  true 

4 

true. 


*<V) 


when  9  is 


This  theorem  is  proven  in  Appendix  B.  The  condition  that  p(x|0,H1) 
be  a  continuous  function  of  8  is  not  particularly  restrictive  and  could 
be  removed  by  first  extracting  the  set  of  0  at  which  any  discontinuities 
exist,  provided  this  set  is  finite.  The  condition  that  p(x|0,H  )  be 
independent  of  0  has  been  introduced  primarily  to  simplify  the  proof 
of  the  theorem.  In  most  applications  this  condition  will  be  met.  If  it 
is  not,  it  can  be  replaced  by  the  requirement  that  p(x|0,H  )  be  a  con- 
tinuous  function  of  9.  These  conditions  are  all  usually  met  in  appli¬ 
cations  so  that  a  finite  set  0^  will  almost  always  exist  in  practice. 

This  theorem  demonstrates  that  in  many  binary  decision  problems  we 
may  quantize  the  space  of  the  unknown  parameter  in  such  a  way  that  if  a 
learning  system  is  constructed  on  the  basis  of  this  quantization,  and  if 
the  learning  system  converges  so  that  it  utilizes  d*(9^),  the  ultimate 
system  performance  will  be  arbitrarily  close  to  the  ultimate  system  per¬ 
formance  of  a  system  based  on  the  unquantized  space.  In  Chapter  VI  we 
shall  demonstrate  that  in  most  binary  decision  problems  the  "quantized" 
system  will  converge  to  d*(6^)  such  that 


p(d*(G  )|e)  =  min  p(d*(ei)|e) 

"  '  *Q 
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In  order  to  illustrate  the  choice  of  quantization  coarseness,  consider 
the  following  example. 

Example:  Suppose  that  we  wish  to  detect  an  unknown  signal  in  gaussian 

A 

noise.  Let  S  =  signal  vector 

K  =  noise  covariance  matrix 
S  =  unknown  parameter 

Then  we  know  [see  Ref.  3  or  23]  that  the  quality  of  performance  of  a 

A 

system  which  is  given  a  priori  knowledge  of  S  is  dependent  on  the 
"divergence"  defined  by 


The  quality  of  performance  of  any  other  linear  system  using  a  slightly 
mismatched  filter  can  be  measured  by  the  ratio  of  divergences. 

The  difference  in  performance  of  two  systems  is  a  continuous  monotonic 
function  of  this  ratio,  and  is  zero  when  the  ratio  is  1.  If  we  require 
that 


(5.7) 


where  e  is  a  small  quantity,  the  performance  difference  will  be  small. 
Thus  if  we  have  a  system  in  which  the  S-space  is  quantized  so  that  the 
nearest  point  to  S  is,  say,  S*  =  S  +  A,  then  [Ref.  3,  p.  45  J 

/in2  (h*'h*)2 

[a>  ‘  (v  W1**) 

_  (ft K'lg)2  +  2(stK'lg)(gtK'la)  +  iK*1*)2  . 

(StK'1S)(§tK'1§  +  2StK‘la  *  itK'li) 
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Relations  (5.7)  and  (5.8)  require  that 


^StK'1S^S*K'1S*]  -  ^StK'1S*)  <  e^StK~1sj^S*K‘1S*j  (5.9a) 

This  relation  may  be  simplified 
one,  to  yield 

(stK“1s)(AK_1A)  -  (sk'V)  <  e'  (5.9b) 


if  (stK-1s)(s^K-1S^) 


is  greater  than 


We  may  use  any  quantization  interval  in  S-space  which  satisfies  (5.9a) 
or  (5.9b)  as  appropriate.  The  resulting  system  will  be  capable  of  per¬ 
forming  nearly  as  well  in  the  steady  state  as  a  system  with  a  priori 

A 

knowledge  of  S. 


D.  SUMMARY  OF  CHAPTER  V 

In  this  chapter  we  have  discussed  the  question  of  system  realizability 
in  terms  of  the  number  of  information  storage  elements  required  of  an 
optimal  learning  system.  We  have  been  able  to  prove  two  important  facts 
about  learning  problems. 

1.  Learning  systems  to  solve  problems  in  which  the  unknown  parameter 
may  take  on  only  a  finite  number  of  values  are  always  finite  in 
size. 

2.  Most  learning  problems  in  which  the  unknown  parameter  may  take  on 
an  infinite  number  of  values  may  be  adequately  represented  by 
problems  in  which  the  number  of  values  is  finite. 

In  the  second  statement  an  adequate  representation  is  one  which  leads  to 
a  system  which  will  perform  almost  as  well  as  the  system  based  on  the 
infinite  model.  The  second  statement  depends  upon  the  fact  that  the 
system  based  on  the  finite  model  will  converge  even  though  the  infinite 
model  is  the  best  representation  of  the  physical  problem. 

In  the  second  statement  the  existence  of  an  adequate  representation 
means  that  for  every  possible  value  of  the  unknown  parameter  in  the 
infinite  set  there  is  a  value  in  a  finite  set  which  is  arbitrarily  "close1’ 


81 


SEL-65-011 


when  "distance"  is  measured  in  terms  of  the  difference  in  performance  of 
the  corresponding  systems.  This  latter  statement  will  become  particularly 
meaningful  in  the  next  chapter  when  we  show  that  learning  systems  based  on 
the  finite-set  representation  will  converge  to  the  finite  system  which  is 
"closest"  in  a  performance  sense  to  the  optimum  system  based  on  the 
infinite  set. 
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VI .  SOME  PROPERTIES  OF  LEARNING  SYSTEMS 

Systems  which  have  been  synthesized  as  proposed  in  Chapters  II  and 
IV  have  several  interesting  properties.  Such  systems  are  stable,  and 
they  converge  to  the  system  which  would  be  optimum  if  the  unknown  param¬ 
eter  were  known.  Furthermore,  systems  which  are  constructed  as  suggested 
in  Chapter  V,  by  quantization  of  the  unknown  parameter,  also  converge  to 
the  discrete  point  in  the  quantized  space  which  is  nearest  the  convergence 
point  of  the  equivalent  nonquantized  system.  A  most  interesting  property 
of  the  recursive  expressions  developed  in  Chapters  II  and  IV  is  the  fact 
that  in  addition  to  being  applicable  to  the  problems  of  those  chapters  they 
are  also  generally  applicable  to  problems  in  which  learning  with  a  teacher 
is  possible  and  to  problems  in  which  no  learning  is  possible. 

It  is  the  purpose  of  this  chapter  to  formalize  the  statement  of 
these  properties,  and  to  specify  the  conditions  under  which  they  hold. 

For  convenience,  we  shall  carry  out  the  following  discussion  in  terms  of 
the  binary  decision  problem  since  with  a  few  obvious  changes  the  dis¬ 
cussion  would  apply  equally  well  to  the  more  general  solution. 

A.  SYSTEM  STABILITY 

Because  the  system  requires  both  delay  feedback  and  feedforward  loops, 
the  question  arises  whether  or  not  there  is  an  input  sequence  which  can 
cause  an  output  which  will  be  unbounded.  Although  we  cannot  answer  this 
stability  question  in  the  normal  control -system  manner,  we  can  provide  a 
satisfactory  answer  in  probabilistic  terms;  that  is,  we  can  show  that  the 
probability  that  the  output  will  grow  without  bound  is  zero.  We  can 
obtain  this  answer  by  showing  that  the  sequence  of  outputs  l(x  |A  ) 

K  K  "  1 

is  a  bounded  martingale  (Appendix  C)  when  >(x|g)  is  bounded  for  all 
0  and  fixed  X.  Since  bounded  martingales  have  the  property  that  they 
are  bounded  for  all  sequences  with  probability  one  [Ref.  31  ] , 

we  will  have  answered  the  stability  question  if  we  can  show  that  i ( X | 0 ) 
is  bounded.  But  certainly  this  must  be  true  unless  the  signal  is  "per¬ 
fectly  detectable, "  and  this  is  a  pathological  situation  which  seems  to 
occur  only  in  textbooks. 
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As  an  example  of  the  boundedness  of  ^ ( X | 0 )  consider  example  1  of 
Chapter  II.  In  this  case  we  identify  0  with  the  unknown  amplitude,  c. 

g(x|c)  =  exp  (-  |  c2BtK_1B  +  cXtK_1B)  g  c  §  r2  (6.l) 

Given  any  input  vector  X  ,  this  function  is  certainly  bounded  by 

exp  (-  |  r^K^B  +  r,,  |  X^K^B  |  )'  (6.2) 

for  all  values  of  c. 

B.  CONVERGENCE  OF  THE  CONTINUOUS  SYSTEM 

In  Chapter  II  we  described  systems  for  the  solution  of  problems  in 
which  an  important  parameter  was  fixed  but  unknown.  An  important  prop¬ 
erty  of  such  systems  is  the  fact  that  they  converge,  so  that  in  a  sense 
they  "learn"  the  fixed  value  of  the  parameter.  In  Chapter  IV  we  described 
similar  systems  for  similar  problems  in  which  the  difference  was  the  fact 
that  the  parameter  was  time  varying.  Since  the  parameter  varies  with 
time,  we  cannot  discuss  the  steady-state  performance  of  these  systems,  and 
therefore  the  following  discussion  is  applicable  only  to  the  systems  of 
Chapter  II. 

We  investigate  the  convergence  by  again  appealing  to  the  martingale 
nature  of  the  output.  In  Appendix  C  we  show  that  if  a  sequence  of 
functions  (^^(X^, . . . .X^)}  exists  such  that 

lim  <t>  (X  ,  ...,X  )  =  6  with  probability  one  (6.3) 

,  K  1  ^ 

k-wo 

then 

0)  with  probability  one  (6.4) 

where  0  is  the  true  value  of  0.  Thus  the  system  (in  the  limit)  per¬ 
forms  as  well  as  one  which  was  designed  with  knowledge  of  the  signal. 


lim  a-Xk|Ak_1)  =  ’  ( X  | 
k-*o 
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As  an  example  of  a  problem  in  which  the  sequence  exists,  we 

may  consider  the  problem  of  detection  of  an  unknown  signal  in  noise. 
Consider  the  linear  estimate  of  S  (the  signal)  given  by 

k 

‘  I  x»  (6-5) 

i=l 

The  observations  may  be  written  as 

xi  =  Ni  +  YiSi  (6.6) 


where 


Y 


i 


if  the  signal  is  transmitted 

if  the  signal  is  not  transmitted 


Thus  the  are  identically  distributed,  independent  random  variables, 

and  by  the  strong  law  of  large  numbers, 


(6.7) 


But  E(X.)  =  p(H  )S;  therefore,  the  sequence  (s.)  is  an  example  of 

XX  K 

the  required  sequence  {$  ) . 


C.  CONVERGENCE  OF  THE  QUANTIZED  SYSTEM 

In  the  previous  chapter  we  pointed  out  that  in  many  cases  the  set  <J> 
of  all  possible  values  of  the  parameter  9  may  not  consist  of  a  finite 
number  of  discrete  points  and  the  optimal  learning  system  may  not  be 
realizable.  In  these  cases  under  very  general  conditions  (see  Theorem  1, 
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Chapter  v)  a  subset  of  $,  say  4  ,  exists  and  has  the  following 
t  Q 

properties: 

(i)  For  every  0  s  $  and  €  >  0  there  exists  a  0  e  such  that 

q  Q 


p(d*(eq)|e)  s  P(d*(e)|e) 


+  €. 


(ii)  There  are  only  a  finite  number  of  points  Q(e)  in  0  for  any 

Q 

e  >  0. 


In  this  section  we  shall  determine  a  sufficient  condition  that  a 

system  based  on  $  will  converge  in  the  following  sense: 

Q 


lim  PQ(xk|e)  =  p(d*(eq)|e) 


(6.8) 


k->°o 


with  probability  one,  where  p  (A.  |b)  =  average  risk  of  system  based  on 

Q  k 

$  after  k  observations  when  0  is  the  true  value  of  9,  and 


p(d*(eq)|e)  =  min  p(d*(ei)  |e) 


(6.9) 


V^Q 


We  shall  show  that  this  condition  is  met  for  most  binary  learning  prob¬ 
lems.  Thus  we  will  demonstrate  that  the  suboptimum  system  is  realizable 
and  has  a  performance  which  is  arbitrarily  close  to  the  performance  of 
the  optimum  (unrealizable)  system. 

In  order  to  determine  a  sufficient  condition  for.  convergence,  we  first 
note  that  the  system  based  on  $  computes  the  functions 

£(X|  0j  )  j  =  1,2 . Q 


^As  previously  defined,  0  is  the  true  value  of  0,  and  p  *( ) j  0 ) 
is  the  average  risk  of  Bayes  decision  rule  based  on  the  assumption  that 

A 

0  is  true  when  A  is  actually  true. 
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and 


n  »(xJv  po(e3> 

■  -n 

i  n  w 

j=l  i=l 


(6.10) 


The  system  takes  the  sum  of  the  products  of  i(x|0j)  and  P(6.,l\) 
the  likelihood  ratio: 


Q 


,(x|\)  -  £  «XI9J> 

j=l 


(6.11) 


,  .  ,  _  based  on  0  requires  a  comparison  of  ^(x|e  ) 

The  Bayes  decision  rule  basea  on  a  4 

J  mv.  it  vi' a  lx  ^  converges  to  1  when  0  is  true, 
to  a  threshold.  Thus  if  P^JV  g 

is  pfvlfl  ")  and  the  performance  of  the  suboptimum 

£  (X  A  )  will  converge  to  i,(X|0  J  ana  me  y 

Q  ^  Ia*(cx  N I Theorem  2  states  that  if  a  minimum- 

system  will  converge  to  p(d*(0q)|0j.  Theorem  i 

risk  solution  exists,  the  system  »1U  converge  to  this  solution. 


Theorem  2:  If  there  is  a  8q.  e  »Q  such  that 


,(d*(eq)|e) 


min 

9  P  4 
J 


>(d*(eJ)|e) 


(6.12) 


and  if  the  distribution  of  the  observation  under  one  hypothesis  is 
independent  of  the  unknown  parameter  9  [i.e.,  p(x| 0,H2)  -  p(x|H2  J 

then 

iim  P(9  |A.  )  =  1  with  probability  one 

<1  k 


lim  P(0.Uj  =  0  with  probability  one  for 

k-K<>  J  k  all  e .  -  V  6 .  4  eq 
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This  theorem  is  proven  in  Appendix  C.  From  the  proof  it  is  clear  that 
when  a  unique  minimum  does  not  exist,  then 


lim  2^  P(9  | A.  )  =  1  with  probability  one 

k-«°  6  e  4„  q 

q  M 


where  4  is  the  set  of  all  points  6  e  4^  which  have  minimum  average 
M  q  Q 

risk.  Since  the  points  are  equivalent  from  a  performance  standpoint,  it 
makes  no  difference  in  performance  whether  the  system  converges  so  that 


for  some  9  e  4., 

q±  M 


or  so  that 


•UxlAj 


I 


£  4, 


M 


Thus  we  may  summarize  by  stating  that  if  9  is  an  important  parameter 
in  the  sense  that  knowing  9  allows  the  design  of  a  better  system,  then 
a  system  based  on  a  discrete  model  for  4  will  be  finite,  will  exist, 
and  will  converge  in  performance. 

D.  RELATIONSHIP  BETWEEN  LEARNING  WITH  A  TEACHER  AND  LEARNING  WITHOUT 
A  TEACHER 

A  very  interesting  relationship  may  be  noted  by  referring  back  to 
Fq.  (2.12).  By  writing  the  recursive  form  as  a  product,  we  find  that 


p(e|xk_i) 


*~l  £(xile)+a 
;  £(x.|Ai_1)  +  a 


P00) 


(6.14) 
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Thus  Eq.  (2.4)  may  be  rewritten  as 


k-lr 


I0)  Po^0)  n 


^(x1|e) 


+  a 


i=l  *“ 


i^XiiAi-l)  +  aJ 


de 


(6.15) 


When  a  teacher  is  available,  we  may  choose  to  train  the  system  on  the 
subset  of  the  X  which  are  known  to  contain  a  signal,  In  this  case 
p(H^)  =  1,  P(Hg)  =  0;  hence  a  =  0,  and  the  system  computes  a  simpler 
form 


k-1 

'^(xje) 

i«>  po(0)  n 

i=l 

Le<xilWj 

(6.16) 


On  the  other  hand,  when  we  let  a  -*  00  the  system  becomes  the  usual 
nonlearning  system 

^(xkl\_i) =  =  /^XJ0)  po(0) de  (6-17) 

This  is  as  it  should  be,  since  as  pC^)  -*■  p(H^)  ->  0  and  we  cannot 

learn  anything  from  the  past. 

Thus  Eq.  (2.12)  describes  a  system  applicable  to  all  (parametrically 
expressible)  binary  decision  problems.  It  applies  even  to  those  in 
which  a  learning  sequence  does  not  exist  and  to  those  in  which  a  properly 
classified  sequence  does  exist.  For  this  reason  the  systems  of  Figs.  2 
and  3  may  be  thought  of  as  canonical  decision  systems.  These  figures 
provide  the  engineer  with  an  insight  into  the  relationship  between  the 
solutions  to  many  binary  decision'  problems,  just  as  the  tapped-delay- 
line  canonical  form  of  the  linear  filter  provides  an  insight  into  linear 
filters . 
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E.  SUMMARY  OP  CHAPTER  VI 


In  this  chapter  we  have  applied  the  results  of  Appendix  C  and  Refs. 

15,  18,  and  20  to  demonstrate  that  the  learning  systems  are  stochastically 
stable  and  converge,  and  we  have  pointed  out  that  the  proposed  systems 
are  generally  applicable  to  the  entire  parametric  class  of  decision  prob¬ 
lems  including  the  "no  learning, "  "learning  with  a  teacher, "  and  "learning 
without  a  teacher"  categories  of  problems. 

We  have  also  shown  that  in  the  cases  where  the  unknown  parameter  is 
useful  in  the  sense  that  knowledge  of  the  parameter  makes  it  possible  to 
make  more  accurate  decisions,  a  finite  system  always  exists  and  converges 
in  performance  to  a  point  arbitrarily  close  to  the  performance  of  a  sys¬ 
tem  with  knowledge  of  the  parameter. 

Thus  a  system  to  learn  without  a  teacher  which  has,  from  an  engineering 
viewpoint,  all  of  the  properties  of  the  optimum  system  may  be  constructed 
from  a  finite  number  of  elements. 
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VII.  SUMMARY  OF  RESULTS  AND  SUGGESTIONS  FOR  FUTURE  WORK 


A.  RESULTS 

The  primary  results  of  this  investigation  have  been  summarized  in 
detail  at  the  end  of  pertinent  chapters,  and  are  briefly  described  in 
the  form  of  the  following  four  major  contributions  of  this  work.  In  the 
first  two  items,  a  statistical  model  has  been  obtained  which  fits  a  large 
class  of  interesting  decision  problems,  and  a  method  has  been  developed 
to  solve  these  problems.  The  third  and  fourth  contributions  have  been 
related  to  the  practicality  of  the  theoretical  systems. 

1.  A  recursive  relation  has  been  developed  which  describes  the  struc¬ 
ture  of  learning  systems  which  are  optimum  for  any  length  of 
learning  sequence.  The  problems  which  may  be  solved  by  such  sys¬ 
tems  are  restricted  to  the  parametric  class  of  decision  problems 

in  which  the  functional  form  of  the  underlying  probability  measures 
is  known;  however  this  class  includes  problems  in  which  the  learning 
sequence  is  not  previously  classified,  as  well  as  problems  in  which 
the  a  priori  probability  of  occurrence  of  different  classes  of 
observations  is  unknown. 

2.  The  solution  has  been  extended  to  problems  in  which  the  unknown 
parameter  is  a  time-varying  random  variable.  It  has  been  shown 
that  solutions  to  the  time-varying  problem  are  straightforward 
modifications  of  solutions  to  fixed  parameter  problems. 

Thus  we  have  obtained  a  statistical  model  which  fits  a  large  class 
of  interesting  decision  problems  and  have  developed  a  method  to  solve 
these  problems.  The  method  results  in  a  theoretical  and  functional 
description  of  decision  systems  to  solve  the  problems.  Our  third  and 
fourth  contributions  have  been  related  to  the  practicality  of  the  theo¬ 
retical  systems. 

3.  It  has  been  demonstrated  that  in  the  case  where  the  unknown  param¬ 
eter  may  take  on  only  a  finite  number  of  values,  the  optimum 
learning  system  requires  a  finite  memory  and  is  therefore  real¬ 
izable  with  a  finite  number  of  elements. 

4.  It  has  also  been  demonstrated  that  so  long  as  the  underlying  prob¬ 
ability  measures  are  either  discrete  or  absolutely  continuous  in 
the  observation  space,  and  so  long  as  the  Bayes  decision  rule 
depends  upon  the  unknown  parameter,  a  finite-memory  suboptimum 
system  exists  which  has  performance  arbitrarily  close  to  the  per¬ 
formance  of  the  optimum  system. 
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B.  PROBLEMS  FOR  ADDITIONAL  RESEARCH 


There  are  many  interesting  and  important  applications  of  decision 
machines  which  learn,  just  as  there  are  many  important  general  problems 
involving  such  machines.  Some  of  the  more  outstanding  problems  are 
given  below  as  suggested  areas  for  future  research: 

1.  It  is  clear  that  in  many  applications  the  functional  form  of  the 
underlying  probability  measures  is  unknown,  and  thus  many  problems 
may  not  be  treated  as  parametric  learning  problems.  A  systematic 
technique  for  the  solution  of  such  problems  would  be  extremely 
useful,  and  ah  investigation  of  the  possibility  of  treating  such 
problems  by  expanding  the  probability  measures  in  a  series  of 
known  functions  with  unknown  parameters  and  coefficients  might 
lead  to  such  a  technique. 

2.  In  this  study  a  finite-memory  system  which  is  optimum  in  an 
engineering  sense  has  been  found  by  approximating  the  space  of 
the  unknown  parameter  with  a  discrete  space.  An  investigation  of 
the  structure  of  the  optimum  system  under  a  finite-memory  con¬ 
straint  might  lead  to  additional  insight  into  the  solution  of 
learning  problems. 

3.  The  investigation  of  performance  bounds  has  been  incomplete  and 
the  bounds  determined  have  been  undesirably  loose.  This  is  due 
primarily  to  the  fact  that  such  bounds  depend  very  much  on  the 
particular  learning  problem  being  solved.  It  is  presently  neces¬ 
sary  to  apply  difficult,  time-consuming  numerical  computation 
techniques  or  to  build  or  simulate  the  system  in  order  to  deter¬ 
mine  whether  the  resulting  performance  will  be  acceptable  or  to 
compare  the  optimum  system  with  some  suboptimum  system.  It  seems 
clear  that  a  simpler  procedure  for  obtaining  tighter  bounds  on 
performance  would  be  very  useful. 
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APPENDIX  A.  EVALUATION  OF  P(Bk) 

In  order  to  evaluate  P(B  )  we  first  note  that  B  may  occur  only 

K  K. 

when  p(s|  A.  )  §  l/2;  therefore  P(B  )  S  Pr  {p(s|A  )  §  1/2) .  To  evaluate 
this  bound  we  shall  determine  bounds  on  the  moments  of  the  distribution 
of  the  random  variable  P(s|Ak)  and  apply  a  Tchebysheff  type  of  bound. 
Thus  in  any  particular  case  the  resulting  bound  may  be  very  loose; 
however  for  the  example  of  Chapter  III  it  is  clear  that  the  bound  is  a 
useful  one. 

Consider  the  estimate  of  S  given  by 

m 

s£  - 1  sip<sA>  ‘A-1> 

1 


Then 

(A. 2 

for  any  other  estimate  $  based  on  A^  because  S*  is  the  least-mean- 

/v 

square  error  estimate  of  S  based  on  A.  .  In  particular,  consider  the 
estimate 


-  sjl2  1  s  e||s  -  *k|2J 


(A. 3) 


Now,  if 


(i) 


with  probability  p^ 
with  probability  p^  =  1-p^ 


(ii)  Signal  and  noise  are  independent, 

( iii )  E(S]  =  E[N]  =  0,  and 

.  .  2 

(iv)  The  noise  is  bandlimited  and  white  with  variance  a  , 
'  '  n’ 
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then 


E{(S  --\)t(S  -  *k)} 


°"n  P1P2 
— ~  +  - 5-  E(S  S} 

kP l  kp2 


(A. 4) 


and  therefore 


E((S  -  Sk)t(S  -  S*)}  S 


<  +  PlP2E(gtS^ 


(A. 5) 


kP^ 


We  may  evaluate  E((s  -  S*)t(s  -  S*))  as  follows.  From  (A.l)  and  the 

fact  that  (si  S.)/cj-2  =  8  R  [see  text  following  Eq.  (3.6)],  wo  have 
1t  J  n 

that 


E(StS*j  =  E(StS}  p(s|Ak) 


(A. 6) 


and 


E(S*  S*)  =  E< 
Kt  k 


2  si  p<sii\) 


i=l 


J  t 


2  si  P<SA> 


i=l 


(A. 7) 


Because  of  the  symmetry,  E^[p(S^  |  Ak)  ]2}  *-s  constant  for  all  ^  S, 


so  that 


E(s£  Sj)  =  E(StS)  ^e{[p(s|  Ar)  ]2}  +  (m-l)  e{[p(  |  7^)  ]2)]  (A.S) 


By  factoring  and  collecting  terms  we  have 

E((S  -  S*)t(S  -  S*))  =  E{StS)  [E((Pk  -  l)2}+  (m-l)  E^}]  (A. 9) 
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where 


K  -  pcglAk> 

pik  -  P<SA> 


Now 


(,-l)  6(pJk}»  0 


and  is  small,  so  that 


e{( 


1 


2  /A  A. 

°n  +  P1P2  E(StS) 

kpj  E(StS} 


(A. 10a) 
(A. 10b) 


(A. 11) 


or 


o  +  P,Pr 


<{o  -  — f 

kp. 


(A. 12) 


where 


R  =  E(StS)//. 


In  order  to  obtain  a  bound  on  p(®k)  we  also  require  the  first 

moment  of  P,  .  To  bound  this  we  write 
k 


I 

i=l 


PJ ,  =1  therefore 
ik 


I 


P..  =  1  -  P. 
ik  k 


(A. 13) 


Hence 


i 

1 


pik  I  -<l- V' 


(A. 14) 
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But 


(A. 15) 


so  that 

e{(1  -  Pk)2}=  (m-1)  E{pik)+  (m'1)(">-2)  E2{Pik)  (A. 16) 


From  Eqs.  (A.5)  and  (A. 6)  we  have 


E{(1  -  P  )2}  *  -  (m-1)  e{p2k}  (a.17) 

kPlR' 


where 


R' 


R 


1  +  PlP2R 


(A. 18) 


so  that 


(m-l)(m-2)  E2(P  )  S  — |— 

kp2R ' 


2(ul-1)  E(Pik) 


(A. 19) 


But  since 


var 


<pik)  -  E{pL)  • 

e{pi\)  $  E2(Plk) 


E2(Plk)  i  0 


(A. 20a) 


(A. 20b) 
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we  can  write 


(m-l)2  E2(P  }  g  — (A. 21) 
kp1R' 


Finally,  by  utilizing  (A.13)  we  may  bound  the  first  moment  as 

l/2 

E(Pk}  §  1  '  I  )  (A-22) 


mkp“R' 


By  utilizing  this  first-moment  bound  and  the  previous  bound  on  the  second 
moment  and  applying  a  Tchebyshef f -type  bound  [Ref.  34,  p.  93],  we  have 


Pr  <Pk  S  |}  S 


j(PlkR'  +  8)  -  4pm~1JkR']  (A.  23) 

2 

which  is  valid  so  long  as  kp^R'  >  16.  By  using  (A. 18)  we  have 

Pr  <?„ s  I) s  - 


Pl(PlkR  +  8(X  +  piP2R^  "  4 


LP2R) 

(m-l)kR(l  +  PjP2R) 

l/2>| 

m 

) 

(A. 24) 


Because  p  kR  »  8(l  +  p  p  R)  for  large  k,  we  may  take  as  our  bound 

X  1  ^ 

for  PlP(Bk), 


PlP(Bk)  i 


4(1+  P2P2R) 


p^kR  -  4 


(m-l)kR(l  +  P1P2R) 


^72 


(A. 25) 
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APPENDIX  B.  PROOF  OF  THEOREM  1 


For  convenience  in  presenting  this  proof,  Theorem  1  of  Chapter  V 
is  repeated  below. 

Theorem  1.  Designate  by  $>  the  space  of  all  possible  values  of 

the  vector  parameter  9,  and  let  the  range  of  each  coordinate  of 

0  be  bounded.  Then  if  p(x|9,H  )  is  independent  of  0,  and 

p(x|0,H1)  is  a  continuous  function  of  0  for  all  0  e  <t>  and  all 

X,  there  exists  a  subset  of  $,  say  $  =  (0  ,9  ,.,.,9  },  with 

Q  12  Q 

a  finite  number,  Q,  of  discrete  values  of  9  such  that  for  any 

e  >  0  and  all  0  £  0  there  is  a  9  e  *  which  satisfies 

<3  Q 

p(d*(9q)|e)  *  p(d*(e)  |ft)  +  e 


where  p^d*(0  )  |  6  j  is  the  average  risk  of  the  Bayes  decision  rule 
based  on  the  assumption  that  0^  is  true  (d*(eq))  when  0  is 
true. 

Proof .  If  p(x|@,H^)  is  a  continuous  function  of  0,  then  so  also  is 
the  integral  over  any  range  of  X.  That  is 


pCxle.Hj)  dX 


is  a  continuous  function  of  9.  Therefore,  given  any  e1  >  0,  there  is 

a  6  >  0  such  that  if  0j  and  9  .  both  lie  v/ithin  a  sphere  of  radius 

i  J 

6 ,  then 


9*,^)  dX  - 


V Hi> 


<  €  1 


We  may  therefore  choose  as  a  possible  set  points  in  C> 

which  are  distance  6  along  some  coordinate  from  an  arbitrary  point. 
Since  the  range  of  values  of  each  coordinate  is  bounded,  there  will  be 
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/\ 

only  a  finite  number  of  points  in  this  set;  furthermore  for  every  9 
in  <t>  there  will  be  a  member,  say  0  .  in  <I>  such  that 

q  Q 


<  € 1 


The  Bayes  decision  rule  based  on  0q,  d*(6q),  divides  the  observation 

space  into  two  mutually  exclusive  regions  R  and  R  .  If  X  e  R  , 

q  q  q 

then  decision  rule  d*(8q)  results  in  the  decision  to  accept  hypothesis 
1.  If  X  e  R  ,  then  the  decision  is  to  accept  H0.  Hence  the  average 


risk  p(d*(9^)|0  ^  when  8  .  is  true  is  given 


by 


p(d*(e  )|e  )  =  pp(x  e  r  |e  ,h,}  +  lp„p(x  e  r  |e  ,hJ 

\  q  '  q/  1  q1  q’  1  2  q1  q  2 


But  p(x  £  R  Is  , H  }  =  P(X  e  R  IhJ  because  the  distribution  of  X  is 
q  q  2  q  2 

independent  of  0  when  H  is  true.  Hence  we  must  have 


PlP(x  e  R  le.Hj)  +  Lp2P(X  e  Rq | H2) 


V  <  P(d*(6q)|eq) 


and 


P(d*(0q)l0q)  <  P1P{X  £  Rql0’Hl]  +  Lp2P(X  E  RqlH2)  +  Pl£ ' 


or 


p(d*(0q)  |  0q)  -  p(d*(6q)  |e) 


<  pie' 


(B.l) 


Similarly,  by  starting  with  p  0 )  |  0 ^  we  have 

P(d*(e) je)  -  p(d*(e)|0  ) 


<  Pje’ 


(B.2) 
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Because  d*  is  a  Bayes  rule,  the  following  must  hold: 


p(d*(0q)leq)  —  P (d*(  9  )  1  6q) 

(B.3) 

p(d*(e)  |e)  s  p(d*(eq)|e) 

(B.4) 

By  inserting  (B.2)  in  (B.3)  we  obtain 

p (d*( eq ) !  0q)  S  p(d*(0)|ej  +  Ple' 

(B.5) 

and  inserting  (B.l)  in  (B.4)  we  obtain 

p(d*(6)|e)  S  p(d*(9q)|Sq)  +  PjS' 

(B.6) 

so  that 

p(d*(0q)leq)  "  p(d*(0)l0)  ^  Ple* 

(B.7) 

The  combination  of  (B.l)  and  (B.7)  yields 

p(d*(eq)|e)  -  P(d*(e)  |e)  *  2Ple- 

and  thus  by  choosing  €'  <  e/2Pl  we  have  proven  the  theorem. 
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APPENDIX  C.  PROOFS  OF  STABILITY  AND  CONVERGENCE 


1 .  System  Stability 

In  order  to  prove  that  the  system  is  stable  we  first  prove  a  more 

4« 

general  theorem  regarding  a  property  of  the  probability  measure  pk(9)  = 
P( 0 | X  , . . . , X  ) ,  which  is  the  cumulative  distribution  function  correspond- 

X  K 

ing  to  the  density  P(®l^k)* 

Theorem  3.  Any  sequence  (g  , g^, . . . , g  j.)  such  that 


g 


k 


f(e)  dPR(e) 


(c.i)* 


where 


Pk(0)  -  P(e|xi . Xk)  1  g  k  S  n+l  (C.2) 


is  a  bounded  martingale  if 

(i)  f ( 0 )  is  any  nonnegative  Lebesgue  measurable  function, 

(ii)  max  f(0)  =M<°°. 

Proof .  A  martingale  is  defined  [Ref.  35,  p.  293]  as  a  sequence  of  random 

variables  [X  , X  ,  ...,X  ,  z]  such  that 
l  z  n 

(iii)  E  C  |  Z  |  )  <«, 

(iv)  X^  =  E(z| w^,w2> . . . (W^]  for  some  set  of  random  variables  (w^) . 
Thus  to  prove  the  martingale  property,  it  is  sufficient  to  prove 


a . 

b. 


E(lgn+ll]  < 

gn  =  E(gn+1 


^This  theorem  is  due  to  Daly  [Ref.  20 ] ;  the  proof  is  repeated  for  con¬ 
venience. 

^In  order  to  include  the  case  where  Pk(@)  is  a  steP  function,  the 
integral  here  is  meant  in  the  Lebesgue-Stielt jes  sense  (see,  e.g., 
Ref.  l). 
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First  we  prove  (a).  Since  f(0)  is  nonnegative  and  bounded  by  M 
on  S>,  and  f  dPn+1(6)  =  then  Sn+^  is  nonnegative  and  bounded  by 

M;  i . e. , 

0  §  gn+1  =  J f (0)  dPn+1(e)  S  J M  dPn+1(e)  =  M  J dPn+1(0)  =  M  <  oo 

(C.3) 

hence 

I *n+l I  -  M  <  “  (C.4) 

and 


E(|gn+1|}  S  M  <  oo  (C.5) 

Since  this  is  true  for  all  n,  we  also  have 

lim  E(  |  g  |  ]  5  lim  M  =  M  <  oo  (c.6) 

n->oo  n-*» 


This  relation  will  be  required  in  the  proof  of  the  boundedness  of  the 


sequence. 

To  prove  (b)  we  must  show 


f(e)  p(e|x1 . xn+1)  d@ 


p(e|x1 . xn)  de 

(C.7) 


where  the  expectation  is  over  the  space  X 


n+1  ’ 


We  may  write 


EtWXr 


p(x  ,  lx. . X  )  dX  .  ( C .  8 ) 

v  n+11  1’  n'  n+1  v  ' 


In  this  case,  since  we  are  only  interested  in  finite  n,  we  need  not 
contend  with  step  functions,  hence  we  write  pk(0)  =  dPk(0)/d0  for 
easier  manipulation. 
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Interchanging  the  order  of  integration  over  $  and  x>  we  have: 


Et*„JXV 


,X  ) 
n 


r  f  f  P(9|X1 . Xn>  P<X1 . XJX, 

Jf  \jy  P(X1 . X„) 

•  P<Xn+l>  dXn*l]  de 

=  f  f(e)  P(e|x  , . . .,x  )  de 

•'(T) 


=  Sn  (C.9) 

Thus  the  sequence  (g^;  n  =  1,2,...}  is  a  martingale.  Doob  [Ref.  35, 

p.  319]  shows  in  theorem  VII,  4.1  that  if  the  sequence  (g^;  n  £  1}  is 

a  martingale,  and  if  lim  E  {|g  |)  =  M  <  °°,  then  lim  g  =  g  exists 

n  n -fo  n 

with  probability  one.  Thus  the  sequence  lg  ;  n  S  1}  does  indeed  con- 

n 

verge  to  a  limit  with  probability  one. 

This  theorem  is  directly  applicable  to  the  proof  of  system  stability. 
We  make  the  identification  f(8)  =  i5(x|0).  Then  f ( 0 )  will  be  a  non¬ 
negative  Lebesgue  measurable  function  of  9.  If  in  addition  X ( © )  is 
a  bounded  function  of  8  for  all  X,  then  the  sequence  j) 

a  bounded  martingale  and  lim  £(X |  A,  )  <  °°  with  probability  one. 

It— *00  K  K“1 

2.  Convergence  of  the  Optimal  System 

In  order  to  find  the  limit  to  which  the  system  converges  we  first 
state  a  theorem  due  to  Braverman  [Ref.  15], 

Theorem  4.  If  there  exists  a  sequence  of  functions  ( <t’k( X^ ,  . .  . , X^ ) } 

A  /S 

such  that  lim  $  =  8  with  probability  one,  where  8  is  the  true 

k->eo 

value  of  0,  then 


lim  P(8|x 
k-*°  1 


9  §  8 


6  <  0 


(C.10) 
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'  A 

By  0  <  8  we  mean  that  every  coordinate  of  0  is  less  than  every 

A 

coordinate  of  0  since  0  may  be  a  vector-valued  parameter. 

Proof .  Braverman  proves  this  theorem  for  the  case  of  learning  with  a 
teacher,  drawing  on  the  fact  that  if  the  sequence  (Xj)  is  known  to 
arise  from  a  particular  class,  then 

gk  =  J  g(0)  dPk(0)  (C.ll) 


is  a  bounded  martingale  if  g(0)  is  bounded  and  Lebesgue  measurable 
on  <t>. 


We  have  already  proven  that  g^  is  a  bounded  martingale  even  when 
(X^)  does  not  arise  from  a  single  class.  Thus  if  we  consider  the 
sequence  of  functions 


(C. 12) 


this  sequence  will  be  a  bounded  martingale  because  it  can  be  written  as 


dPR(0) 


(C. 13) 


where 


f 1  9  E  E0 

h  =  \  (C‘14) 

9  lo  e  i  Ee 


is  the  indicator  function  of  the  set 


(E  ) ;  hence 

0 


lim  P  (E  )  =  P  (E  )  with  probability  one 

k-*oo  k  0  0 


(C.15) 
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Loeve  [Ref.  36]  points  out  that  if  the  sequence  (X^Xg,  . . .  ,Xn,z) 
is  a  bounded  martingale,  then  E[z|xi, . . . ,Xn)  converges  with  probability 
one  to  z.  If  we  let  z  =  I_  ,  then  the  sequence  (P  (E  ))  must  con- 

Ee  k  e 


verge  to  either  1  or  0. 

The  existence  of  the  convergent  sequence  (^(Ar)}  must  imply  that 


P  (E  )  converges  to  1  when 
k  y 


0  is  contained  in  E  ,  and  converges  to  0 

0 


when  0  is  not  in  E 


0' 


Thus  must  be  a  (multidimensional)  step  function  with  a  dis- 

/S 

continuity  at  0  with  probability  one. 

We  may  extend  this  theorem  to  the  following  corollary. 


Corollary.  If  there  exists  a  sequence  • ■ • > X.)j  such  that 


lim  $  =  0  with  probability  one  (c.16) 

k-*» 


then 


lim  j  f(e)  dP  (e)  =  f(e)  with  probability  one  (C.17) 

k->°° 

if  f(@)  is  continuous  on  $. 

This  follows  from  the  above  theorem  and  the  fact  that 

lim  I  f(e)  dP  (0)  =  I  f(e)  dPro(e)  with  probability  one  (C.18) 
k-^oo  -At  k  •'ci 

if  f ( 0 )  is  continuous  on  $  and  P  (8)  has  bounded  variation  on  4>. 
By  definition  of  the  Lebesgue-Stielt jes  integral,  if  P00(9)  is  a  step 

A 

function  at  0,  then 


(c. 19) 
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Hence  if  ^ ( X | 0 )  is  a  continuous  function  of  9,  we  have  the  fact  that 

lim  I  $(x| 9)  dP  (0)  =  X | 0 )  with  probability  one  (c.20) 

k-**>  k 

,  A  . 

(where  0  is  the  true  value  of  9)  if  the  sequence  exists. 

3.  Convergence  o^  the  Quantized  System 

In  order  to  prove  Theorem  2,  Chapter  V  we  first  determine  a  sufficient 
condition  for  convergence  as  follows. 

Theorem  5.  If 


E(log  p(x|e  )  -  log  p(x| 0  ) | 0}  >  0 

4  J 


(C.21) 


for  some  0  e  <!>_  and  every  0  .  £  *  ,  0.^0,  then 

q  Q  j  Q  j  q 


lim  p(e  |Ak)  =  1  with  probability  one 
k-»°°  q 


lim  P(0  | A  )  =  0 

k^o  J  k 


with  probability  one 


(c. 22a) 

(C. 22b) 


Proof .  If 


E{ log  p(x|@  )  -  log  p(x| 0  .  ))  =  P  >  0  (C. 23) 

J 


for  all  0  .  E  G>  .  Q  J  Q  e  then 

j  Q  j  q  Q 


2 


lim  y  log 
k-*x> 

i=l 


pCXj|eQ) 

p(xje’) 


=  kp  with  probability  one  (c.24) 
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every 

e  >  0,  there  exists  a  k 

such 

r 

k 

P  ( Xi  1  )" 

1 

Pr  < 

LUB 

| 

Z  log 

p(xjej 

-  kp 

<  € 

/ 

i=l 

L  J*  .  j  j 

where  LUB  means  least  upper  bound.  But 


LUB 


k 

v 

~P(X±  1  ea)~l 

Z  log 

i=l 

<  e 


implies  that 


LUB 


.  k 


exp 


i  ^  p(xje  ) 

r  z  iog  L^Ctvj 


i=l 


exp  (-kp) 


,  -kp,  e 
<  e  K(e 


Therefore,  for  all  6  >  0,  there  exists  a  k  such  that 


Pr 


LUB 


V. 


n 

W  p(xje') 


i=l 


-  e 


-kp 


<  5>  =  1 


or,  for  every  6'  >  0  there  exists  a  k  such  that 


1 

r 

i  p(xi iej) 

Pr  { 

LUB 

k. 

^  p(x  fe  ) 

i=l  ^ 

<60 

( C . 25 ) 


(C. 26) 


l)  (c. 27 ) 


(C. 28) 


(C. 29) 
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The  function  being  computed  is 


f<eA> 


which  may  be  written  as  below. 


p(x.  |e .)  p  (e .) 
tt  1  r  o  .r 

11  p(x. |e  )  p  (e  ) 

i  q'  o'  q' 


P(0j|Ak) 


i=l 


41  *  P(x.  [e  .)  p  (e  .) 

!  +  2  n  — 


j=l  i=l 


(  C .  30 ) 


For  each  0  .  £  A  and  each  k.  define  £  such  that 
J  Q  sjk 


LUB 


n  P^iK) 

II  P(xi|e^)  <  ^jk 


i=l 


(C.31) 


then 


LUB 


|p<0,I\)I 


,  w 


V  n  p(xi|ei>  p„<e, 

‘AH  jjxAV  fV 

,  ,  ,  ,  '  i  Q  o'  q 


_<c 
)  CJ'< 

y 


i=l  j=l 


Hence  for  every  £  >  0  there  exists  a  k  such  that 


(C.32) 


Pr  (lub  |p(e  |Ak)|  <  C)  =  1 


(C.33) 


Similarly, 


LUB  1 


AM 


<  1 


n  y  1  p(xiM  po(ei> 

1  I  l=1 


vl 


v-  p  (e.) 

<  -  ?Jk  P>q' 

J  =  1 


(c. 34) 
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so  that  by  choosing 


we  obtain 


r  P  (e  ) 
s  C  q 

m&x  C  ,  v  /  \ 

J  SJk  Q  -  1  PjGj) 


lub  |i  -  p(e  | A  )|  <  c 


Therefore,  for  all  £  >  0,  there  exists  a 


k  such  that 


Pr  {lUB  |  P(  0q  l\>  -  *1  <  0  ’  1 


(C. 35) 


(C. 36) 


So  that 


lim  P(0  |  A.)  =  1  with  -probability  one  (c.37a) 

k->oo  q  k 


lim  P(0.|a.  )  =  0  with  probability  one  (c.37b) 

k-^0  J  k 


which  proves  Theorem  5. 

Theorem  2.  If  there  is  a  0  e  such  that 

-  q  Q 


P 


<  min 


0  .  e  $ 
3 


Q 


(6.12) 


and  if  the  distribution  of  the  observation  under  one  hypothesis  is 
independent  of  the  unknown  parameter  0  [i.e.,  p(x|0,H2)  =  p(x|H2)], 

then 

lim  P(0  |  A  )  =  1  with  probability  one 

k-wo  q 

(6.13) 

lim  P(0  .U.  )  =  0  with  probability  one  for 

k-*--o  J  all  e.  f  i}„,  9.  /  9 

J  Q  J  q 
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Proof .  Condition  (6.12)  implies  that 


where 


PqU) 

a 


=  probability  density  of 
=  any  real  number  >  0 


i(x|eq) 


To  prove  this,  we  observe  that  (S.12)  implies 


(C. 38) 


(  C .  39  ) 


where  y  =  LPg/Pj^ 

p^  =  a  priori  probability  of  being  true 

By  rearranging  this  inequality,  and  changing  limits,  we  have 


-CO  ,-CO 

h  J„  1-Pq(  ^  I  Hi )  -  PjUlHj)]  d£  >  Lp  2J  [pq(i|H2)  -  Pj  ( i  |  H2 )  ]  d& 

( C .  40 ) 


for  all  7  >  0.  Now  assume  that 


PjU)  d-t 


(C. 41 ) 
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for  some  real  number  a  .  Then 

o 


[p1Pq(i|H1) 


+  P2PqU|H2)]  £ 


[p1PJ(^|Hl)  +  P2PjU|h2)]  d£ 


(C . 42 ) 


or 


-00  00 
\  J  [p^l^)  -  pj(^|h1) ]  dU  g  -  p2  J  [Pq(i|H2)  -  Pj(£|h2)]  U 


(C.43) 


Combining  (c.40)  and  (c.43)  yields 


/“  00 

[pq(£|H2)  -  Pj(je|H2)]  d i  <  -  P 2  J  [pq(£|H2)  -  Pj(f|H2)]  dZ 

/  ao 

(C.44) 


for  all  7  >  0.  Suppose  that  aQ  >  0,  then  we  can  choose  y  =  a^, 


hence  L  p„  =  a  p„ ,  and  (c.44)  becomes 
o  2  o  1 


aoPl  <  •  P2 


Hence  a^  cannot  be  positive,  and  for  all  positive  real  numbers  a, 


(C.38)  must  hold. 

Consider  the  function 


E(iog  P(x[e  )  -  log  p(x|e,)|@)  =  E 


q.  j 


» 


m 
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It  may  be  written  as 


e  =  E(iog  [a  +  J0(x|e  )]  -  log  [a  +  i(x|e  )]|e) 

4 »  J  v-  4  J 

+  E{iog  [plP(x|eqlH2)] -iog[p1p(x|ej,H2)]|e} 

where  a  =  P2/pr  But  since  p(x|0q,H2)  =  p(x|e  ,H2)  =  p(x|h2),  the 
second  term  on  the  right  is  zero.  The  function  log  [a  +  $];  1^0  is 

monotonically  increasing  and  continuous;  hence  it  may  be  approximated  by 
a  sum  of  simple  functions: 


lim 

N-hOO 

A-*0 


N 

^  [a  +  i] 

i=0 


almost  everywhere 


where 


P±  =  log 

so  that 


when  l  >  iA 

when  £  <  iA 

[a  +  iA]  -  log  [a  +  ( i-1  )a] 


E 

J 


[a  +  £]  (Pq(i)  -  Pj(^)) 


N 

^  P^U)  (Pq( ^ )  -  Pj(-t)) 

1 
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=  lira 
N-*00 

But  the  sum  of  a  set  of  positive  numbers  must  be  positive  so  that 

E(iog  p(x|e  )  -  log  p(x|e .)}  >  o 

q  J 

By  Theorem  5  this  implies  the  convergence  of  P(©ql\)  to  l>  and  Proves 
Theorem  2. 


00 

2  pi  lp«.u)  ■  pju))  a 
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